<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<rfc ipr="trust200902" category="info" docName="draft-comcast-flair-crypto-discovery-00">
  <front>
    <title abbrev="FLAIR Cryptography Discovery">FLAIR: Framework for Language-Agnostic Discovery of Cryptographic Elements using Semantic Graphs</title>
    <author surname="Kayas" fullname="Golam Kayas">
      <organization>Comcast Corporation</organization>
      <address>
        <email>golam_kayas@comcast.com</email>
      </address>
    </author>
     <author surname="Jagaputhran Sampathkumar" fullname="Jagaputhran Sampathkumar">
      <organization>Comcast Corporation</organization>
      <address>
        <email>jagaputhran.s@gmail.com</email>
      </address>
    </author>
        <author surname="Sekar" fullname="Barath Sekar">
      <organization>Comcast Corporation</organization>
      <address>
        <email>barath_sekar@comcast.com</email>
      </address>
    </author>
        <author surname="B" fullname="Sri Surya Gayatri B">
      <organization>Comcast Corporation</organization>
      <address>
        <email>srisuryagayatri_b@comcast.com</email>
      </address>
    </author>
    <author surname="Dev" fullname="Jayati Dev">
      <organization>Comcast Corporation</organization>
      <address>
        <email>jayati_dev@comcast.com</email>
      </address>
    </author>
    <date day="27" month="April" year="2026" />
    <abstract>
      <t>
        This document introduces FLAIR (Framework for Language-Agnostic Intermediate Representation), a novel approach to discover cryptographic elements in source code. By detecting encryption components present in code, FLAIR addresses critical challenges like identifying deprecated and misconfigured implementations across diverse programming languages. Unlike language-specific tools, FLAIR employs semantic graphs to represent instances of encryption detected, enabling identification of cryptographic algorithm usage, associated parameters (keys, nonces, salts), and flag compliance status with established encryption standards. The semantic graph approach makes representing cryptographic algorithms and related elements independent of the underlying programming language.
      </t>
    </abstract>
  </front>

  <middle>
    <section anchor="problem" title="Problem Statement">
      <t>
        Modern enterprise and cloud infrastructure increasingly relies on encrypted communication to protect sensitive data in transit. With the increasing integration of encrypted communication across enterprise and cloud infrastructure, a growing number of network outages and performance degradations are now traced to undetected cryptographic vulnerabilities in code <xref target="ref1"/>. Organizations face growing security risks from:
      </t>
      <ol>
        <li>Deprecated cryptographic algorithms that no longer meet security standards</li>
        <li>Misconfigured cryptographic implementations causing network outages</li>
        <li>Incompatible cryptographic implementations across heterogeneous systems</li>
        <li>Lack of visibility into cryptographic usage across enterprise codebases that impacts migration efforts to newer encryption standards, like Post-Quantum Cryptography (PQC)</li>
        <li>Achieving comprehensive coverage without developing language-specific solutions</li>
      </ol>
      <t>
         However, the proliferation of diverse programming languages, cryptographic libraries, and implementation patterns in source code creates significant impediment in building a holistic visibility tool capable of detecting and analyzing cryptographic usage across diverse network applications. Consequentially, organizations may struggle to identify and remediate cryptographic vulnerabilities before they lead to service disruptions. 
      </t>
    
    </section>

    <section anchor="existing" title="Alternate Approaches to Cryptographic Discovery in Code">
     <t> 
     Current open-source cryptographic detection solutions fall into two categories with distinct limitations: (1) general-purpose static analyzers (e.g., Bandit, Snyk, CodeQL, Joern, etc.), and (2) language-specific tools (e.g., CryptoGuard, CogniCrypt, Cryptolation, etc.). 
     </t>
      <section anchor="general-analyzers" title="General-Purpose Static Analyzers">
        <t>
          Tools such as Bandit, Snyk, CodeQL, and Joern/CPG <xref target="ref2"/> are examples of general purpose static analyzers. They are typically used to detect vulnerabilities in code, such as security bugs, vulnerable libraries, and policy violations before the code is deployed. Static code analyzers have wide language coverage - there is a tool for nearly all major programming languages today. Cryptographic discovery is possible using these tools. They can break down code into graphs that can follow rules to detect specific cryptographic usage. However, such detections can have a high false-positive rate and will need several rules across different languages to achieve traceability of cryptographic materials (e.g., keys, nonces, salts, digests). 
        </t>
      </section>
      <section anchor="lang-specific" title="Language-Specific Tools">
        <t>
          Tools such as CryptoGuard, CogniCrypt, and Cryptolation are examples of language-specific tools. These open-source tools provide precision within supported languages. In many cases, they go beyond pattern-matching to provide semantic understanding of cryptographic usage using machine learning models. However, they are designed to work extremely well for a single programming language. Enterprises which have polyglot codebases suffer from limited visibility during cryptographic discovery. Even though they receive high precision results from these tools, they would have to maintain multiple toolchains to cover their entire codebase. This leads to significant development and maintenance overhead.
        </t>
      </section>
    </section>  
    <section anchor="definitions" title="Definitions">
        <dl newline="true">
        <!-- Omit newline="true" if you want each definition to start on the same line as the corresponding term -->
        <dt>Semantic graphs:</dt>
        <dd>Graphs that represent the semantic meaning of code elements and their relationships, enabling deeper understanding of program structure and behavior.</dd>
        <dt>Code property graphs:</dt>
        <dd>Graphs that represent code elements and their properties, including control flow, data flow, and inter-element relationships.</dd>
        <dt>Cryptographic elements:</dt>
        <dd>Components involved in cryptographic operations, such as algorithms, keys, nonces, salts, and message digests.</dd>
        <dt>Language-agnostic:</dt>
        <dd>An approach that is independent of any specific programming language, allowing for broad applicability across diverse codebases.</dd>
        <dt>Polyglot codebases:</dt>
        <dd>Codebases that contain source code written in multiple programming languages.</dd>  
        <dt>Deprecated cryptographic algorithms:</dt>
        <dd>Cryptographic algorithms that are no longer considered secure or recommended for use, such as MD5 and SHA1.</dd>
        <dt>Misconfigured cryptographic implementations:</dt>
        <dd>Cryptographic implementations that do not adhere to best practices or security standards, leading to vulnerabilities.</dd>
        <dt>NIST</dt>
        <dd>National Institute of Standards and Technology, a U.S. federal agency that develops and promotes measurement standards, including cryptographic standards.</dd>
        <dt>Post-Quantum Cryptography (PQC):</dt>
        <dd>Cryptographic algorithms designed to be secure against attacks from quantum computers, such as lattice-based or hash-based algorithms which have been standardized by NIST.</dd>
        <dt>Alias analysis:</dt>
        <dd>A technique used in program analysis to determine if two expressions in a program may refer to the same memory location.</dd>
      </dl>
    </section>
    <section anchor="solution" title="Proposed Solution: FLAIR Framework">  
      <t>
        This draft proposes a new framework for cryptographic discovery that combines the best of both existing approaches - language-specific tools with the coverage of general purpose static analyzers. There is a critical need for a language-agnostic approach which can (1) identify the usage of cryptographic libraries, (2) determine cryptographic parameters supporting their usage (keys, nonces, salts), and (3) specify which cryptographic algorithms are being invoked. This draft proposes a novel semantic approach that detects cryptographic libraries, how the invoked algorithms methods in these libraries get accessed throughout the code, and the associated cryptographic parameters. This language-agnostic approach overcomes the limitations of existing solutions by producing a unified JSON representation across diverse languages (Python, Java, JavaScript, etc.) to support cross-language rule development.
      </t>
         <section anchor="design-principles" title="Design Principles">
        <t>
          FLAIR is the first cryptographic detection tool achieving language independence through semantic graphs. This approach makes cryptographic discovery independent of function nomenclature and variable naming conventions. The detection of relationships between libraries, keys, and function calls remains unaffected by variable renaming or library substitution. Semantic graphs have demonstrated effectiveness in biological sciences <xref target="ref3"/>. This approach reduces false positives compared to pattern-matching solutions. Additionally, FLAIR can detect keys propagated across multiple function calls throughout an entire codebase.
        </t>
        <t>
          FLAIR also detects and stores cryptography-relevant information only. By limiting crypto-relevant nodes and edges, FLAIR remains lightweight even on large codebases while achieving efficiency. 
        </t>
      </section>
      <section anchor="architecture" title="Technical Components and Architecture">
              <t>
          FLAIR consists of two main components:
        </t>
        <ol>
          <li>Feature Extraction and Normalization: This component processes source code from multiple programming languages to extract cryptography-relevant information and normalize it into a unified intermediate representation (IR).</li>
          <li>Crypto Semantic Graph (CSG) Construction and Backtracking: This component constructs a Crypto Semantic Graph (CSG) from the normalized IR, enabling multi-hop backtracking to identify relationships between cryptographic elements and their usage in code.</li>
        </ol>
        <t>
          The following diagram illustrates the overall FLAIR technical flow and components.
        </t>
        <figure anchor="fig-flair" title="FLAIR process flow diagram.">
          <artwork type="ascii-art"><![CDATA[
                   +---------------------------+
                   |   Multi-Language Sources  |
                   |---------------------------|
                   | Python | Java | JS | Rust |
                   | C/C++ | ... (etc.)        |
                   +------------+--------------+
                                |
                                v
      +-----------------------------------------------------+
      |    Step 1: Feature Extraction and Normalization     |
      |-----------------------------------------------------|
      |     Parser & Normalizer (Tree-sitter / ANTLR)       |
      |                         |                           |
      |                         v                           |
      |       FLAIR IR Builder (Crypto-specific JSON)       |
      |                         |                           |
      |                         v                           |
      |         IR Semantic Nodes (CALL, ENT, SYM)          |
      +-------------------------+---------------------------+
                                |
                                v
      +------------------------------------------------------+
      | Step 2: Crypto Semantic Graph (CSG) and Backtracking |
      |------------------------------------------------------|
      |           Crypto Semantic Graph (CSG)                |
      |                         |                            |
      |                         v                            |
      |               Multi-hop Backtracker                  | 
      |                         |                            |
      |                         v                            |
      |             Compliance / Rules Engine                |
      |(as needed to flag weak keys, deprecated algos, etc.) |
      +-------------------------+----------------------------+
                                |
                                v
                      +------------------+
                      |     Database     |
                      +------------------+
          ]]></artwork>
        </figure>
      </section>
   
    </section>

    <section anchor="technical" title="Technical Details">
      <section anchor="step1" title="Step 1: Feature Extraction and Normalization">
        <t>
          The first step utilizes a general purpose parser (like Tree-sitter or ANTLR) to parse the various cryptographic libraries and code constructs. The language-specific components are then generalized to their core algorithms and their specifications (like "ECDSA", "RSA", with parameters like key sizes, e.g., "256"). The resultant JSON from the parser is then converted into a FLAIR IR (Intermediate Representation) graph JSON. This resultant IR now contains only mappings between detected libraries and their usage in code. The graph structure utilizes nodes and edges - Nodes, to represent cryptographic information with contextual metadata; and Edges, to represent relationships between information and its code usage.
        </t>
        <t>
          Four sub-processes execute sequentially in this step:
        </t>
        <ol>
          <li>Library Detection: Identifies cryptographic libraries through import statements and source references (e.g., OpenSSL). FLAIR uses partial parsers (e.g. Tree-sitter or ANTLR grammars) for each language. The adapter walks the AST to find calls, imports, and literal constructions. Only crypto-relevant code is analyzed.</li>
          <li>Normalized Crypto API Call Detections: Detects API calls to cryptographic functions, normalizing them to primitive representations (e.g., CryptoJS.AES.encrypt() becomes AES.encrypt). A central table maps specific API patterns or strings to normalized names. E.g. all patterns matching cryptography.*.ec.ECDSA normalize to ECDSA; an arg "SHA-384" normalizes to SHA384. This unifies library calls (OpenSSL vs CryptoJS vs WebCrypto) under common primitives. </li>
          <li>Cryptographic Entity Creation Identification: Detects generation of key pairs, nonces, message digests, and related cryptographic materials.</li>
          <li>Normalize Detected Ciphers: Establishes unified naming independent of language (e.g., all cryptography.*.ec.ECDSA combinations and patterns map to ECDSA). </li>
        </ol>
        <t>
          Detected ciphers, API calls, and entity information are stored in JSON format for downstream processing. Abstract crypto-related constructs into three node types: (1) CALL represents crypto API invocation (AES.new() , MessageDigest.getInstance("SHA1")), (2) ENT represents data entities (keys, IVs, nonces, constants), and (3) SYM represents symbolic identifiers (variables, parameters, aliases). This reduces language-specific syntax into a uniform crypto-specific IR JSON. 
        </t>

 
      </section>
      <section anchor="step2" title="Step 2: Crypto Semantic Graph (CSG) and Backtracking">
        <t>
          Now that we have three types of nodes, the next step is to add contextual information to these nodes. This happens in Step 2. In this step, the JSON graph is transformed into a Crypto Semantic Graph (CSG). A CSG is a semantic graph which stores rich entity information in nodes while edges represent relationships between nodes.
        </t>
        <t>
          The CSG construction process is as follows:
        </t>
        <ol>
          <li>Node type specification (CALL, ENT, SYM) with specific metadata is represented in the IR.</li>
          <li>Edge creation: Shows connections between elements (e.g., API function calling specific library). Within a single function ("intra-procedural"), a def-use analysis on the IR graph is done to track where variables are defined and where they are used in function calls. Across functions ("inter-procedural"), we apply alias analysis to follow dependencies beyond function boundaries. We maintain mappings of variable aliases so that related operations remain connected even if variable names change. For example, if a key is generated (key = genKey()) and later passed to another call (signer.init(key)), these operations can be linked regardless of renaming.</li>
          <li>Multihop backtracking to determine relationship directionality and trace cryptographic element origins: When a relationship can be established between the functions, an edge is added to the CSG. Otherwise, the edge can be temporarily marked as unresolved and can be optionally retried with deeper alias refinement. Edges can be constructed to capture the following semantic flows: (1) flows_to: data dependency (key passed to AES call), (2) produces: function output (AES.new → cipher object), (3) alias_of: variable alias tracking (key = masterKey), (4) taints: external/untrusted source flows (IV from user input). This forms a CSG of cryptographic interactions, not just isolated API calls to libraries.</li>
        </ol>
        <t>
          Multihop backtracking traverses the Crypto Semantic Graph (CSG) backwards from a cryptographic sink (e.g., AES.encrypt, ECDSA.sign) to reconstruct the full provenance of all security-critical parameters (keys, nonces, salts, libraries, and related cryptographic materials). Each hop corresponds to a semantically meaningful dependency edge in the CSG. 
        </t>
       
         <figure>
          <artwork type="json">
        def backtrace_entities(graph, start_call, target_roles=("key","iv","nonce","salt")): 
            Q = deque([start_call]) 
            visited = set()                            # check if node is already visited
            lineage = {"key": [], "iv": [], "nonce": [], "salt": []} 
            while Q: 
                node = Q.popleft() 
                if node in visited: 
                  continue 
                  visited.add(node) 
                for (src, dst, type) in graph.in_edges(node): 
                    if type == "flows_to":              # follow dependencies (salt, key, iv, nonce)     
                        ent = src 
                        if ent.role in target_roles: 
                          lineage[ent.role].append(ent) 
                    for (p, e, t2) in graph.in_edges(ent): 
                        if t2 == "produced_by":         # find who created this element 
                            Q.append(p) 
          return lineage 
         </artwork>
        </figure>
        <t> 
          This finds all possible key/IV ancestors within a bounded hop count. It can detect, for example, that an IV entity with a constant origin comes from a hard-coded literal. In practice we limit depth (e.g. less than 50 hops) and stop at function/class boundaries. 
        </t>
      </section>
      <section anchor="example-output" title="Example Use Case Output">
        <t>
          Consider a developer using cryptographic libraries to encrypt API traffic to web servers. The application must import cryptographic libraries, invoke algorithm-specific functions, and generate required keys. In the semantic graph JSON output, FLAIR detects the called algorithms. A post-processing step assembles file graphs into a repo-level CSG. Findings are emitted in JSON or SARIF format: each includes file, line, primitive, hash, OID, confidence, tags (e.g. STATIC_IV), and the key/IV lineage summary. Example finding: 

        </t>
        <figure>
          <artwork type="json">
          {
            "file": "src/crypto/sign.py",
            "line": 42,
            "primitive": "ECDSA",
            "hash": "SHA256",
            "risk_tags": [],
            "lineage": {
              "key": [{"origin":"gen_keypair","curve":"secp256r1"}],
              "iv": []
            },
            "explanation": "ECDSA(SHA256) with EC P-256 key; lineage consistent."
          }
          </artwork>
        </figure>
        <t>
          The graph reveals source file, utilized algorithms, and associated key pairs. If the function used SHA1 instead of SHA256, FLAIR would detect this deprecated algorithm violation. The language-independence of this approach means cryptographic discovery is unaffected by function nomenclature or library switching, providing superior robustness to traditional pattern-matching approaches. The output can be stored in a database for further analysis or reporting. The output can also be extended to flag compliance violations based on organizational policies (e.g., deprecated algorithms, weak key sizes, etc.).
        </t>
        <t>
        Organizations implementing FLAIR should will have to write custom rules for their cryptographic policies. Additional operational considerations include:
        </t>
      <ul>
        <li>Establishment of cryptographic standards and algorithm lifecycle policies</li>
        <li>Integration of FLAIR output with existing security scanning and policy enforcement pipelines</li>
        <li>Training for security and development teams on interpreting semantic graph outputs</li>
        <li>Periodic updates to detection rules for emerging cryptographic algorithms and threats</li>
      </ul>
      </section>
    </section>

    

    <section anchor="security" title="Security Considerations">
      <t>
        The framework does not itself process sensitive cryptographic material; it analyzes code patterns and dependencies. Users should ensure appropriate access controls on source code repositories and generated reports containing cryptographic dependency information.
      </t>
    </section>

    <section anchor="conclusion" title="Conclusion">
      <t>
        FLAIR represents a significant advancement in cryptographic element discovery by providing language-agnostic, semantic graph-based analysis that overcomes limitations of existing general-purpose and language-specific tools. By enabling comprehensive identification of cryptographic usage across polyglot enterprise environments, FLAIR facilitates compliance with cryptographic standards, detection of policy violations, and proactive migration to quantum-safe cryptography.
      </t>
      <t>
        The framework's targeted approach to graph optimization ensures scalability across large codebases while maintaining the precision necessary for accurate security analysis and policy enforcement.
      </t>
    </section>
  </middle>

  <back>
    <references title="References">
      <reference anchor="ref1" target="https://dl.acm.org/doi/full/10.1145/3507682">
        <front>
          <title>Industrial Experience of Finding Cryptographic Vulnerabilities in Large-scale Codebases</title>
          <author fullname="Ya Xiao" initials="Y." surname="Xiao"/>
          <author fullname="Yang Zhao" initials="Y." surname="Zhao"/>
          <author fullname="Nicholas Allen" initials="N." surname="Allen"/>
          <author fullname="Nathan Keynes" initials="N." surname="Keynes"/>
          <author fullname="Danfeng (Dasphne) Yao" initials="D." surname="Yao"/>
          <author fullname="Cristina Cifuentes" initials="C." surname="Cifuentes"/>
          <author>
            <organization>ACM Digital Library</organization>
          </author>
          <date year="2023" />
        </front>
      </reference>
      <reference anchor="ref2" target="https://www.splunk.com/en_us/blog/learn/static-code-analysis.html">
        <front>
          <title>Static Code Analysis: The Complete Guide to Getting Started with SCA</title>
          <author fullname="Shanika Wickramasinghe" initials="S." surname="Wickramasinghe"/>
          <author>
            <organization>Splunk</organization>
          </author>
          <date year="2025" />
        </front>
        <seriesInfo name="DOI" value="10.1145/3372802" />
      </reference>
      <reference anchor="ref3" target="https://pmc.ncbi.nlm.nih.gov/articles/PMC5860058/">
          <front>
              <title>Neuro-symbolic Representation Learning on Biological Knowledge Graphs</title>
              <author fullname="Mona Alshahrani" initials="M." surname="Alshahrani"/>
              <author fullname="Mohammad Asif Khan" initials="M." surname="Khan"/>
              <author fullname="Omar Maddouri" initials="O." surname="Maddouri"/>
              <author fullname="Akira R Kinjo" initials="A." surname="Kinjo"/>
              <author fullname="Núria Queralt-Rosinach" initials="N." surname="Queralt-Rosinach"/>
              <author fullname="Robert Hoehndorf" initials="R." surname="Hoehndorf"/>
              <date year="2017"/>
            <!-- [CHECK] -->
          </front>
        </reference> 
    </references>
  </back>
</rfc>