We started working on Code-Analyser with a very simple idea: How can we parse large GitHub repositories more efficiently?
Initially, the goal was not to build an agent or a complex system. The motivation came from a typical developer problem: spending hours on a GitHub repository, trying to figure out which function is helpful for the task at hand. As the project evolved, we gradually added better parsing strategies, metadata handling, and incremental analysis. Over time, this application evolved into what we now call Code-Analyser, although that was never the original plan.
Giving this application a LangGraph-based backend unlocked several advantages, which we will discuss throughout this article, including:
- How we handle large repositories without parsing everyth…
We started working on Code-Analyser with a very simple idea: How can we parse large GitHub repositories more efficiently?
Initially, the goal was not to build an agent or a complex system. The motivation came from a typical developer problem: spending hours on a GitHub repository, trying to figure out which function is helpful for the task at hand. As the project evolved, we gradually added better parsing strategies, metadata handling, and incremental analysis. Over time, this application evolved into what we now call Code-Analyser, although that was never the original plan.
Giving this application a LangGraph-based backend unlocked several advantages, which we will discuss throughout this article, including:
- How we handle large repositories without parsing everything
- How do we retrieve only relevant information from user queries
- Why parsing only when the user asks is a big win
- How the LangGraph state schema and nodes are designed
Don’t worry if some of these ideas feel unfamiliar right now; they are explained step by step in the sections below.
-
References Github Code-Analyser Demo
What is Code Analyser?
Code-Analyser is a LangGraph-based code analysis engine that uses Gemini 3 as the LLM for reasoning and final responses. It is built using the nodes, edges, and shared state concepts provided by the LangGraph framework.
At a high level, Code-Analyser allows you to load a GitHub repository and then ask natural-language questions about the codebase, such as:
- “Where is the data ingestion logic implemented?”
- “Which file handles authentication?”
- “Explain how the inference pipeline works?”
Instead of manually searching through files, the system analyzes the repository structure, parses only the relevant files as needed, and produces clear answers.
However, this was not the original goal. The project began with a much simpler objective: To avoid wasting time navigating large repositories to understand how something works. Once we added efficient repository parsing, metadata tracking, and incremental parsing, the first usable version of Code-Analyser was ready.
What is LangGraph?
Choosing the proper orchestration framework was a critical decision because it would serve as the backbone of Code-Analyser. The system required more than simple sequential execution.
Code Analyser needed something that can perform operations like:
- Conditional execution paths
- Looping logic
- Persistent state across multiple steps
- Clean coordination between independent tasks Directed Acyclic Graph
This is where LangGraph fits perfectly. LangGraph is an orchestration framework built around an adaptive Directed Acyclic Graph(the one shown in the image above). Instead of writing a single large script, we define small nodes and connect them via edges, all while sharing a central state.
Why LangGraph fits Code-Analyser
Node-level logic Each node in LangGraph is responsible for one specific task, for example, fetching metadata, analyzing a query, parsing files, or summarizing results. This makes the system modular and easier to reason about.
Shared state LangGraph uses a shared, mutable state that is passed between nodes. Each node reads from this state and returns only the information it updates. LangGraph then merges these updates back into the global state.
Clean orchestration flow The general execution pattern using LangGraph looks like this:
- A node receives the current shared state
- It performs a small, focused operation
- It returns a minimal state update
- LangGraph merges the update (or state delta) into the shared state and moves to the next node
A basic implementation of the LangGraph nodes using Python is shown below:
from langgraph.graph import StateGraph, MessagesState, START, END
def mock_llm(state: MessagesState):
return {"messages": [{"role": "ai", "content": "hello world"}]}
graph = StateGraph(MessagesState)
graph.add_node(mock_llm)
graph.add_edge(START, "mock_llm")
graph.add_edge("mock_llm", END)
graph = graph.compile()
graph.invoke({"messages": [{"role": "user", "content": "hi!"}]})
If you want to get a thorough overview of creating agents responsible for tasks like browser automation and self-correcting code, visit our other articles at LearnOpenCV:
Why Code-Analyser is better at Codebase Understanding
Code-Analyser is designed to do more than scan a repository for files. It builds a structured understanding of the entire codebase, enabling natural language queries that feel conversational and context-aware.
Once a repository is loaded, Code-Analyser crawls the project directory and builds an initial state that captures key structural elements. Rather than parsing every file up front, it uses an on-demand approach: when a user asks a question, the system identifies the parts of the codebase relevant to that query and parses only those files. Parsed files are then cached, and previous conversations are retained. This allows follow-up questions to be answered faster and with better continuity.
Code-Analyser focuses on query intent, not just keyword matching. It tries to understand what the user is actually looking for, such as a pipeline, a module, or a specific responsibility, and selects files accordingly!!
In practice, these capabilities of Code-Analyser yield an interactive experience that lets us explore complex codebases naturally. Instead of manually searching through hundreds of files, we can ask questions such as Where is the data ingestion logic? Or explain how authentication works, and Code-Analyser will locate the relevant code, summarize it in human-friendly terms, and recall related topics as the conversation progresses.
** Download Code** To easily follow along this tutorial, please download code by clicking on the button below. It’s FREE!

Installation
git clone https://github.com/bhomik749/Code-Analyser
cd Code-Analyser
pip install -r requirements.txt
Set environment variables in a .env file.
- GitHub token
- LLM keys
State Schema design in Code-Analyser
The state schema serves as the shared memory layer across all LangGraph nodes and enables the system to coordinate repository indexing, incremental parsing, and multi-turn conversational reasoning.
The state schema that we have designed for Code-Analyser is provided in the code block below:
class Agent_State(TypedDict):
messages: Annotated[Sequence[BaseMessage], add_messages]
url: Union[str, None]
repo_tree: Dict[str, any]
global_context: Union[str, None]
selected_files: List[Dict[str, any]]
unselected_files: List[str]
parsed_files: List[Dict[str, str]]
intent: str
keywords: List[str]
targets: Dict[str, any]
summary: str
llm: LLM
Incremental Parsing and File-Level Caching
The parsed_files field acts as a persistent cache that stores the content of every file that has already been processed. When a new user query arrives, the system first checks this cache to determine whether the required files have already been parsed. If so, parsing is skipped entirely, avoiding redundant computation and unnecessary LLM calls. This is what we are calling Incremental Parsing.
The unselected_files field is maintained purely for traceability and observability. It records which files were skipped during a query because they already existed in the parsed_files state variable. This field makes debugging and performance analysis easier.
Intent-Driven Execution Flow
To keep execution focused, the schema explicitly separates query understanding from code analysis. The fields intent, keywords, and targets store the structured output of the query_analyser_node. This intermediate representation maps the user’s request to the information* the system needs to find*. As a result, file selection and parsing decisions become both faster and more accurate.
To get a closer look at how query_analyser_node actually extracts essential details from the query, refer to the following code snippet implemented in Python:
import re
INTENT_PATTERNS = {
"function_usage": [
r"where.*function",
r"usage of",
r"where is .* used",
r"who calls",
r"find usages",
],
"type_lookup": [
r"type of",
r"what.*type",
r"datatype of",
],
"pipeline_flow": [
r"pipeline",
r"flow",
r"process flow",
r"execution flow",
r"data flow",
],
"directory_question": [
r"what'?s inside",
r"what is inside",
r"show.*directory",
r"explain.*directory",
r"what does .* folder",
],
"architecture_summary": [
r"architecture",
r"overall structure",
r"design",
],
}
.
.
.
.
messages = state.get("messages", [])
for msg in reversed(messages):
if isinstance(msg, HumanMessage):
user_query = msg.content
break
intent = detect_intent(user_query)
keywords = extract_keywords(user_query)
targets = extract_targets(user_query)
return {
"intent": intent,
"keywords": keywords,
"targets": targets,
}
Conversation Management with Reducers
Supporting multi-turn conversations requires handling of conversational state efficiently. In a typical state machine, if a node returns a state field, it overwrites the previous value. This behavior is problematic for chat history, as it would erase prior messages on every iteration.
To address this, Code-Analyser uses the messages state variable with the Annotated type and add_messages as a reducer function, as can be seen from the state schema code block. These functions instruct LangGraph to append new messages to the existing message history rather than replacing them, enabling persistent memory context and constant memory feedback.
Contextualizing LLM Reasoning
The global_context field stores a high-level blueprint of the entire repository. It is intentionally lightweight (far smaller than parsing the whole codebase) but still sufficient to describe what major files and modules do and how they fit together.
The human_msg prompt and system_msg prompt provided to Gemini to generate a global context summary of the entire codebase is given in the code snippet below:
prompt = f"""
You are an expert software architect.
Below is a summary of a GitHub repository structure and small snippets from key files.
### File Structure (first 60 files):
{tree_summ}
### Key File Headers:
{headers if headers else 'No key files found.'}
Please describe in 5–8 sentences:
1. The overall purpose of this repository.
2. The main components or modules and their likely roles.
3. How these modules might interact logically (e.g., data → model → evaluation).
4. Which parts appear to be core, supporting, or documentation.
"""
system_msg = SystemMessage(
content= """
You are an expert github repository summarizer
and provide insights on what functions and modules
are present in the repository and how they are connected
to each other. Helping fellow user in understanding
the repository basically in leymann terms if possible.
""")
The summary field then captures the final LLM-generated response. This output is produced by utilizing the user query, the global context, and the parsed contents of the selected files.
End-To-End Code-Analyser Pipeline
In this section, we walk through the complete Code-Analyser workflow and explain how the system operates at the node level. The application is designed around two distinct responsibilities: repository indexing and multi-turn conversational analysis, as shown in the code block below. To support these responsibilities efficiently, Code-Analyser is implemented using two separate LangGraph workflows, each optimized for a specific interaction phase.
This separation allows the system to avoid redundant computation, reduce latency, and progressively improve performance as the conversation evolves.
Why Two Workflows?
index_app = indexing_workflow.compile()
qa_app = qa_workflow.compile()
__all__ = ["index_app", "qa_app"]
The first workflow, the Repository Initialization / Indexing Workflow, is executed only once, i.e, when a GitHub repository is loaded for the first time. Its purpose is to establish a high-level understanding of the codebase by capturing static metadata and generating a global overview. By doing this up front, the system avoids recalculating the repository structure and global summaries for every user query.
The second workflow, the Conversational Q&A Workflow, is executed for every user question. It is responsible for interpreting user intent, selecting relevant files, performing incremental parsing when needed, and generating responses. This workflow builds directly on the indexed data produced by the initialization phase, enabling faster, more context-aware queries over time.
Together, these two workflows allow Code-Analyser to separate one-time structural analysis from repeated interactive reasoning, which is essential for scalability.
Repo Initializer: Indexing Workflow
Github Repository indexing pipeline
The indexing workflow creates the foundational knowledge required to understand the repository. It focuses on structure rather than deep file-level parsing and prepares the shared state on which subsequent analysis depends.
fetch_repo_metadata_node
The primary responsibility of this node is to construct the repository’s structural representation. It fetches the repository tree using the GitHub API without downloading file contents. This produces a lightweight repo tree object that records file paths, sizes, and extensions.
By avoiding full-file downloads at this stage, Code-Analyser can make high-level decisions such as filtering out binary files, identifying documentation (e.g., README), or prioritizing configuration and entry-point files without incurring unnecessary costs.
global_context_node
This node builds a high-level semantic understanding of the repository. Using the previously constructed repo tree object, it selectively fetches and analyzes only a small set of key files, such as README files, setup scripts, main entry points, configuration files, and core application modules.
The output of this node is stored in the global_context state variable as a concise LLM-generated summary. This summary acts as a blueprint of the repository, describing its purpose, principal components, and overall architecture. The presence of this global context ensures that even highly specific questions about individual files are answered with awareness of the project’s broader goal (for example, understanding that a function belongs to a data ingestion pipeline rather than treating it in isolation).
Conversational Q&A: QA Workflow
Multi Chat conversational pipeline
The QA workflow handles all user interactions and leverages the indexed data from the initialization workflow. Each node in this workflow plays a specific role in transforming a natural-language query into a focused, context-aware answer.
query_analyser_node
This node interprets the user’s query by extracting structured signals such as intent, keywords, and targets. Rather than passing raw user input downstream, the system converts the question into a form that downstream nodes can reason over deterministically.
This step maps the user’s request to the system’s task, reducing unnecessary LLM overhead and preventing broad, unfocused analysis.
analyze_repo_node
The analyze_repo_node is a query-aware decision-making node. It determines which files are relevant to the current question by combining the extracted intent signals with the global context. Based on this analysis, it populates selected_files with only the files that need further inspection.
At the same time, the node checks against parsed_files to identify files that have already been processed but were not relevant to the recent query. These are recorded in unselected_files, ensuring that redundant parsing is avoided and exists for explainability and debugging purposes only. This node is critical for keeping the LLM context window small and focused.
The code block below demonstrates the basic workflow of analyse_repo_node(The actual code block differs from the one provided.):
def analyze_repo_node(state):
selected = []
for file in state["repo_tree"]:
if any(keyword in file for keyword in state["keywords"]):
selected.append(file)
return {
"selected_files": selected
}
fetch_and_parse_node
This node implements the core idea of incremental parsing. It iterates over selected files, fetches file contents only when necessary, and parses each file using tools appropriate to its extension (.py, .ipynb, .md, etc). The resulting structured representations are appended to the parsed files cache.
The incremental nature of this step ensures that each file is parsed at most once per session. As more queries are asked, the system accumulates useful parsed knowledge, making subsequent queries faster and cheaper.
The code block below shows a basic workflow for fetch_and_parse_node:
# Tool registry
PARSERS = {
".py": parse_python,
".md": parse_markdown,
".txt": parse_markdown,
".json": parse_json_yaml,
".yaml": parse_json_yaml,
".yml": parse_json_yaml,
".ipynb": parse_notebook,
}
.
.
.
.
parser_fn = PARSERS.get(ext, None)
if parser_fn:
try:
parsed = parser_fn(raw_content)
except Exception as e:
parsed = f"<Error parsing file {path}: {e}>"
else:
parsed = raw_content[:5000] # token-safe limit
new_pf.append({
"path": path,
"ext": ext,
"parsed": parsed
})
updated_pf = new_pf + parsed_files
print(f"Total parsed files: {len(updated_pf)} files.")
return {
"parsed_files": parsed_files + new_pf,
"messages": state.get("messages", []) + [
SystemMessage(content=f"Fetched & parsed {len(new_pf)} files.")
]
}
summarize_repo_node
The final node generates the answer. It combines the user query, the global context, and the relevant entries from parsed files to generate a coherent, multi-turn response. The output is stored in the summary state variable and returned to the user. This is the stage at which the LLM’s reasoning capability is most heavily used, but it operates only within a curated context.
State Evolution Across Queries
The benefits of this design become clearer when observing how the state evolves over multiple queries. During the first question, the system may rely heavily on global context and trigger parsing for a small number of files, which are then stored in parsed_files state variable. On a follow-up question, the system again extracts intent but finds that many required files already exist in the cache, so it moves them to unselected_files state variable instead of re-parsing. By the third or fourth deep-dive query, the system often has enough accumulated file-level knowledge to respond quickly, reusing previously parsed content while fetching new files only when strictly necessary.
Why This Design Scales to Large Repositories
Code-Analyser scales effectively by aggressively controlling both context size and compute cost. Rather than feeding the entire repository into the LLM, the system relies on lightweight structural metadata, a compact global context, and query-driven file selection. This design makes Code-Analyser suitable not only for small projects but also for large, real-world repositories with hundreds or thousands of files.
Failure Modes and Trade-offs
Like any query-driven system, Code-Analyser has trade-offs. The primary failure mode occurs when a critical dependency file is not selected during analysis, like a shared utility module or configuration file that implicitly influences behavior. Additionally, heuristic-based intent extraction is fast and cost-effective. Still, it may struggle with highly ambiguous queries compared to embedding-based retrieval, like asking *tell me more about this file *then the system might not be able to understand what this means here.
Future iterations of Code-Analyser could introduce hybrid retrieval strategies, such as combining heuristic selection with embedding-based fallback, to further improve robustness without sacrificing efficiency.
Summary
By orchestrating two complementary LangGraph workflows and maintaining a carefully designed shared state, Code-Analyser achieves an efficient balance between performance, scalability, and depth of understanding. The system avoids redundant computation, protects the LLM context window, and progressively builds knowledge as conversations evolve, making it a practical and extensible foundation for large-scale code analysis.
References
Code-Analyser: https://github.com/bhomik749/Code-Analyser
Why doesn’t Code-Analyser parse the entire repository upfront?
Parsing an entire repository can be slow, expensive, and unnecessary, especially for large projects. Code-Analyser uses an on-demand parsing approach, meaning files are parsed only when the user’s query requires them. This keeps the system fast, reduces token usage, and prevents the LLM from being overwhelmed by irrelevant information.
Why is selected_files not stored using a reducer?
selected_files represents files relevant to the current query only. Using a reducer would cause selections from previous queries to accumulate, potentially adding more redundant context and reducing accuracy. For this reason, selected_files is intentionally reset on every query.
How does the system handle multi-turn conversations?
Conversation history is stored in the messages state variable using a reducer. This ensures that new messages are appended rather than replacing existing ones. As a result, Code-Analyser can understand follow-up questions and maintain context throughout a session.
Can Code-Analyser miss important files during selection?
Yes, in some cases. Since query intent drives file selection, particular dependency or utility files may be missed. However, this is often corrected naturally through follow-up questions. Future versions can also introduce fallback or hybrid retrieval strategies to reduce this risk.
Why not use embeddings or vector search from the start?
Embedding-based retrieval is robust but can be expensive and more complex to control. Code-Analyser starts with a lightweight, deterministic file selection process to keep behavior predictable. Embeddings can be added later as an optional enhancement or fallback.
Was This Article Helpful?