Natural language developer queries sound like a toy until you watch someone spend ten minutes answering a question the platform already knows:
- "Who owns the checkout service?"
- "Where is the Terraform for staging Redis?"
- "What changed before the payments incident?"
- "Which services still point at the old Kafka cluster?"
- "Where is the runbook for rotating this credential?"
Most engineering organizations already have the data. The problem is that the data is scattered across GitHub, Backstage, Terraform, CI logs, deployment systems, incident tools, docs, and chat history. Developers do not want another dashboard. They want the answer, the source, and the confidence level.
That is where a LangChain-based prototype can be useful. Not because LangChain is magic, and certainly not because a language model should be allowed to invent infrastructure facts. The useful version is more boring: retrieve the relevant developer metadata, pass only that context to a model, return a grounded answer, and show the links that justify it.
In this article, we will build the shape of a full-stack prototype for natural language developer queries. The goal is not to ship a production platform in one blog post. The goal is to build a credible first slice that teaches the right architecture: ingest, normalize, index, retrieve, answer, cite, observe, and evaluate.
If you are thinking about this in the context of an internal developer portal, you may also want to read Beyond Git: Using LLMs to Power Your Internal Developer Portals. That article talks more about the portal strategy. This one stays closer to the prototype implementation.
What We Are Building
We are going to build a small developer metadata question-answering service:
- A JSON data source that represents service metadata.
- An ingestion script that turns service records into searchable documents.
- A Chroma vector store backed by OpenAI embeddings.
- A LangChain retrieval chain that answers questions using retrieved context.
- A FastAPI endpoint that exposes the query engine.
- A few practical guardrails so the prototype does not lie with confidence.
The example questions are intentionally operational:
Who owns checkout-service?
Where is the Terraform for the staging database?
Which services deployed in the last 24 hours?
What runbook should I use for payment retries?
The important design constraint is this: the model should not be the database. It should be the language layer over data you control.
Why Not Just Use SQL?
If all the data lived in a clean relational model, SQL would be better. For questions like "which services deployed in the last 24 hours," a structured query against a deployment table beats semantic search every time.
Real developer metadata is messier than that:
- Service ownership may live in Backstage YAML.
- Runbooks may live in Markdown.
- Terraform module names may live in code.
- Deployment events may come from CI/CD systems.
- Incident summaries may live in ticketing systems.
- Engineers may use three names for the same service.
Natural language querying helps when the developer does not know where to look or what exact field name to search. It is less useful when the question is already well-structured. A good system should eventually combine both: semantic retrieval for fuzzy discovery, structured queries for facts that need precision.
That distinction matters because it keeps the prototype honest.
Project Layout
Here is a small but realistic layout:
dev-query/
app/
api.py
ingest.py
query.py
settings.py
data/
services.json
storage/
chroma/
pyproject.toml
For a prototype, keep the pieces boring. Do not start with Slack, a browser UI, single sign-on, and five data sources. Start with one data source and prove that the answer quality is worth more investment.
Install The Dependencies
LangChain has moved toward separate integration packages. For this prototype, install the core packages, OpenAI integration, Chroma integration, FastAPI, and Uvicorn:
python -m venv .venv
source .venv/bin/activate
pip install \
langchain \
langchain-openai \
langchain-chroma \
chromadb \
fastapi \
uvicorn \
pydantic-settings
Then set the model provider key:
export OPENAI_API_KEY="..."
Do not bake API keys into config files, Docker images, source examples, or screenshots. Internal developer tools have a bad habit of becoming production tools after everyone has forgotten the prototype shortcuts.
Create A Small Metadata Source
Start with data/services.json:
[
{
"name": "checkout-service",
"owner": "team-payments",
"slack": "#team-payments",
"pagerduty": "payments-primary",
"lifecycle": "production",
"repo": "github.com/example/checkout-service",
"docs": "https://internal.example.com/runbooks/checkout",
"infra": "terraform/services/checkout/rds.tf",
"last_deploy": "2026-06-08T16:22:00Z",
"summary": "Handles cart checkout, payment authorization, and order handoff."
},
{
"name": "catalog-api",
"owner": "team-commerce-platform",
"slack": "#commerce-platform",
"pagerduty": "commerce-platform-primary",
"lifecycle": "production",
"repo": "github.com/example/catalog-api",
"docs": "https://internal.example.com/runbooks/catalog",
"infra": "terraform/services/catalog/opensearch.tf",
"last_deploy": "2026-06-07T21:10:00Z",
"summary": "Serves product catalog search and detail APIs."
}
]
This is deliberately simple, but notice what it includes:
- Ownership.
- Communication channel.
- On-call reference.
- Repository.
- Runbook URL.
- Infrastructure path.
- Deployment timestamp.
- Human-readable summary.
Those fields are the difference between a useful developer assistant and a parlor trick. If the source data is thin, the answer will be thin.
Settings
Create app/settings.py:
from pathlib import Path
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
data_path: Path = Path("data/services.json")
persist_directory: Path = Path("storage/chroma")
collection_name: str = "developer_services"
embedding_model: str = "text-embedding-3-small"
chat_model: str = "gpt-4.1-mini"
settings = Settings()
For a real internal system, configuration should include environment-specific data paths, auth settings, logging configuration, and model routing. For a prototype, this is enough.
Ingest The Data
Create app/ingest.py:
import json
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from app.settings import settings
def service_to_document(service: dict) -> Document:
text = f"""
Service: {service["name"]}
Owner: {service["owner"]}
Slack: {service["slack"]}
PagerDuty: {service["pagerduty"]}
Lifecycle: {service["lifecycle"]}
Repository: {service["repo"]}
Documentation: {service["docs"]}
Infrastructure: {service["infra"]}
Last deploy: {service["last_deploy"]}
Summary: {service["summary"]}
""".strip()
return Document(
page_content=text,
metadata={
"service": service["name"],
"owner": service["owner"],
"repo": service["repo"],
"docs": service["docs"],
"source_type": "service_catalog",
},
)
def rebuild_index() -> None:
services = json.loads(settings.data_path.read_text())
documents = [service_to_document(service) for service in services]
embeddings = OpenAIEmbeddings(model=settings.embedding_model)
Chroma.from_documents(
documents=documents,
embedding=embeddings,
collection_name=settings.collection_name,
persist_directory=str(settings.persist_directory),
)
if __name__ == "__main__":
rebuild_index()
print(f"Indexed services from {settings.data_path}")
Run it:
python -m app.ingest
This builds a local Chroma index. In production, you would probably ingest from Backstage, GitHub, CI, Terraform state, docs, and incident tooling on a schedule. But do not start there. Start with one source, learn what answer quality looks like, then add sources intentionally.
Build The Query Chain
Older LangChain examples often use RetrievalQA. That pattern still appears in
plenty of tutorials, but the current direction is to compose retrieval and
generation more explicitly. That is good. You want the prompt, retriever, and
answer behavior to be visible.
Create app/query.py:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from app.settings import settings
SYSTEM_PROMPT = """
You are an internal developer metadata assistant.
Answer only from the retrieved context. If the context does not contain the
answer, say that you do not know and suggest the next system to check.
Return:
- answer: concise direct answer
- sources: service names, docs URLs, repos, or infra paths used
- confidence: high, medium, or low
Do not invent owners, on-call rotations, deployment times, repositories, or
infrastructure paths.
Context:
{context}
"""
def build_chain():
embeddings = OpenAIEmbeddings(model=settings.embedding_model)
vector_store = Chroma(
collection_name=settings.collection_name,
persist_directory=str(settings.persist_directory),
embedding_function=embeddings,
)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_messages(
[
("system", SYSTEM_PROMPT),
("human", "{input}"),
]
)
llm = ChatOpenAI(model=settings.chat_model, temperature=0)
document_chain = create_stuff_documents_chain(llm, prompt)
return create_retrieval_chain(retriever, document_chain)
def ask(question: str) -> dict:
chain = build_chain()
result = chain.invoke({"input": question})
return {
"question": question,
"answer": result["answer"],
"context_count": len(result.get("context", [])),
"sources": [
document.metadata for document in result.get("context", [])
],
}
The prompt does a few important things:
- It tells the model to answer only from retrieved context.
- It makes uncertainty acceptable.
- It asks for sources.
- It forbids invented operational facts.
That will not eliminate hallucinations, but it makes the desired behavior explicit and reviewable. This is the same basic discipline behind writing secure prompts for developer workflows: scope the task, constrain the output, and make failure modes visible.
Add A CLI For Fast Testing
Before building an API, add the cheapest possible interface:
from app.query import ask
while True:
question = input("dev-query> ").strip()
if question in {"exit", "quit"}:
break
if not question:
continue
result = ask(question)
print()
print(result["answer"])
print()
print("Sources:")
for source in result["sources"]:
print(f"- {source}")
print()
Run it:
python cli.py
Try:
dev-query> Who owns checkout-service?
dev-query> Where is the checkout infrastructure?
dev-query> What is the on-call policy for search?
That last question should probably fail or return low confidence unless your context actually includes the answer. A useful prototype is not one that answers everything. A useful prototype knows when the indexed data is insufficient.
Expose It With FastAPI
Once the CLI works, add app/api.py:
from fastapi import FastAPI
from pydantic import BaseModel, Field
from app.query import ask
class QueryRequest(BaseModel):
question: str = Field(min_length=3, max_length=500)
class QueryResponse(BaseModel):
question: str
answer: str
context_count: int
sources: list[dict]
app = FastAPI(title="Developer Metadata Query API")
@app.post("/ask", response_model=QueryResponse)
def ask_question(request: QueryRequest) -> dict:
return ask(request.question)
Run it:
uvicorn app.api:app --reload
Then call it:
curl -s http://127.0.0.1:8000/ask \
-H 'content-type: application/json' \
-d '{"question":"Who owns checkout-service and where are its docs?"}' | jq
At this point, you have a real prototype boundary:
- Ingestion can run separately.
- Querying can be tested independently.
- The API has request and response schemas.
- A future Slack bot or internal portal can call the same backend.
That is already better than stuffing a model call into a random bot handler and hoping nobody asks where the data came from.
What To Do About Slack
Slack is a good interface and a bad first architecture.
It is a good interface because developers already ask operational questions there. It is a bad first architecture because Slack-specific concerns can quickly bury the retrieval problem: permissions, retries, event signatures, ephemeral messages, slash command UX, rate limits, and response timeouts.
Build the query API first. Then wire Slack to the API.
A slash command should do roughly this:
- Acknowledge the command quickly.
- Send the question to your internal query API.
- Return a concise answer.
- Include links to sources.
- Avoid posting sensitive answers into public channels.
That last point is not optional. Developer metadata can expose service topology, incident history, internal repositories, and team responsibilities. Treat it as internal data, not public trivia.
Evaluation: The Part Everyone Skips
The prototype is not done when it returns a pretty answer. It is done when you have some idea whether the answer is right.
Create a small evaluation set:
[
{
"question": "Who owns checkout-service?",
"must_include": ["team-payments"],
"must_not_include": ["team-commerce-platform"]
},
{
"question": "Where is the checkout Terraform?",
"must_include": ["terraform/services/checkout/rds.tf"]
},
{
"question": "Who owns a service named pricing-v3?",
"must_include": ["do not know"]
}
]
Then run it every time you change the prompt, model, chunking strategy, or data source. You do not need a fancy evaluation platform on day one. A small script that catches obvious regressions is enough to keep you honest.
Later, you can add:
- Relevance scoring for retrieved documents.
- Human review of low-confidence answers.
- Trace logging with LangSmith or another observability tool.
- Per-source freshness checks.
- Feedback buttons in Slack or the portal UI.
Without evaluation, you are just vibe-testing infrastructure answers. That is not a great career strategy.
Production Considerations
There are several gaps between this prototype and a production internal developer assistant.
Permissions
The prototype retrieves from one local dataset. A real system needs per-user authorization. If a developer cannot see a repository, incident, document, or deployment record directly, the assistant should not reveal it indirectly.
This is the part teams underestimate. Retrieval systems can become permission laundering systems if you index everything into one bucket and forget who is allowed to see what.
Freshness
Developer metadata goes stale quickly. Ownership changes. Services move. Runbooks get replaced. Deployment state changes hourly.
For structured data, prefer live API calls when freshness matters. For docs and runbooks, scheduled re-indexing may be fine. For deployment status, a vector index may be the wrong tool entirely.
Source Quality
The model cannot rescue bad metadata. If half your service catalog has
owner: unknown, your assistant will mostly automate disappointment.
Use the prototype to expose metadata quality problems. That may be the biggest organizational value of the project.
Observability
Log the question, retrieved document IDs, model version, prompt version, latency, and whether the answer was rated useful. Do not log secrets. Do not log raw private content unless you have a clear retention policy.
Prompt and model changes should be versioned. Otherwise, when the assistant starts giving worse answers, you will have no idea why.
Where LangChain Helps
LangChain is useful here because it gives you a common vocabulary for prompts, retrievers, models, documents, chains, and integrations. It also makes it easier to swap pieces while you are learning.
But the framework is not the product. The product is the engineering workflow:
- Can developers find the right owner faster?
- Can new hires discover systems without interrupting three people?
- Can incident responders find the relevant runbook under pressure?
- Can platform teams see where metadata is missing?
If LangChain helps you answer those questions faster, use it. If a smaller custom service would do the job better, use that. The architecture should serve the workflow, not the other way around.
Conclusion
A natural language developer query tool is worth building when it is grounded in real metadata, honest about uncertainty, and connected to the systems engineers already use.
The prototype in this article is intentionally modest: JSON metadata, Chroma, OpenAI embeddings, a LangChain retrieval chain, and FastAPI. That is enough to learn the important lessons without burying yourself in platform work.
Start small. Keep sources visible. Add evaluation earlier than feels necessary. Respect permissions. Do not let the model invent operational facts. If the first version mostly teaches your organization that its metadata is incomplete, that is still useful information.
The goal is not an all-knowing AI teammate. The goal is a practical developer tool that turns scattered engineering metadata into answers people can verify.
For more practical engineering and developer tooling notes, visit Slaptijack.