Leveraging Codebase Knowledge Graphs for Agentic Code Generation

An agent generating code with language models can face several challenges without sufficient context about the codebase:

Duplicating functionality due to lack of awareness of existing functions and classes
Violating design patterns and architectures, such as directly querying the database instead of using a repository pattern
Incorrectly using new libraries and frameworks due to knowledge gap.
Misaligning with business requirements and not fulfilling intended use cases due to lack of context
Unable to integrate with existing data models and schemas, leading to duplicate fields and data inconsistencies

This article presents an approach for providing complete codebase context to a ReAct agent by:

Mapping function call relationship graphs in a graph database, detecting entry points into the server.
Fetching relevant code at runtime along with file paths
Generating function and entry point level explanations for the corresponding flows for every entry point.
Serving the explanations and metadata through a vector database that the agent can query with natural language

Alternative systems we experimented with:

System Architecture:

Flow for Generating Complete Code Context

In a processing step, flow.py is triggered for the entire project codebase right after parsing and endpoint detection is completed.
It captures all the functions, classes, entry points defined in the system.
For each entry point:
- It fetches the code for all functions in the flow from GitHub (we do not want to store code in our db.)
- It generates detailed natural language explanations of each function in the flow, capturing input, response, business logic, exceptions.
- It summarizes the overall intent of the endpoint from the function explanations. This is a single line statement of purpose of each API.
The endpoint paths, inferred intents, function explanations, and Pydantic models are loaded into the vector DB. We are using Weaviate as our vector DB of choice.
To load this data into Weaviate and query it, ideally, we would chunk, embed, set overlap, define query algorithm etc but for simplicity, here we use Embedchain to build the RAG tool.
When a user requests code generation for a specific entry point:
- The ReAct agent uses tools to:
  - Directly fetch relevant code and Pydantic definitions for that entry point
  - Query the vector DB with natural language to retrieve explanations and metadata for that endpoint and its functions
Equipped with granular code and high-level intent, the agent generates the requested code.

Benefits

Providing explanations for the full codebase upfront allows rapidly serving context for any entry point.
Vector search enables dynamically retrieving only the most relevant explanations for the entry point of interest
Having context of other flows helps generate better results. Questions like "How can I create a document" can be answered with a reference to the POST /document/ API and file where it is present. This kind of queries asked by agents can help in tasks like writing a test for a DELETE /document/{id} endpoint that might need to insert a document first.

Conclusion

Mapping full codebases through a lens of entry points in vector databases provides a powerful tool for AI code generation agents
This approach equips language models with broad yet accessible context to generate code that respects the structure and intent of the full system
Techniques like this will be key to enabling agents to program by understanding entire codebases rather than just snippets in isolation

Command Palette