Building a Full-Stack LangChain Prototype for Natural Language Developer Queries

Posted on in Programming

Natural language developer queries sound like a toy until you watch someone spend ten minutes answering a question the platform already knows:

  • "Who owns the checkout service?"
  • "Where is the Terraform for staging Redis?"
  • "What changed before the payments incident?"
  • "Which services still point at the old Kafka cluster?"
  • "Where is the runbook for rotating this credential?"

Most engineering organizations already have the data. The problem is that the data is scattered across GitHub, Backstage, Terraform, CI logs, deployment systems, incident tools, docs, and chat history. Developers do not want another dashboard. They want the answer, the source, and the confidence level.

That is where a LangChain-based prototype can be useful. Not because LangChain is magic, and certainly not because a language model should be allowed to invent infrastructure facts. The useful version is more boring: retrieve the relevant developer metadata, pass only that context to a model, return a grounded answer, and show the links that justify it.

In this article, we will build the shape of a full-stack prototype for natural language developer queries. The goal is not to ship a production platform in one blog post. The goal is to build a credible first slice that teaches the right architecture: ingest, normalize, index, retrieve, answer, cite, observe, and evaluate.

If you are thinking about this in the context of an internal developer portal, you may also want to read Beyond Git: Using LLMs to Power Your Internal Developer Portals. That article talks more about the portal strategy. This one stays closer to the prototype implementation.

What We Are Building

We are going to build a small developer metadata question-answering service:

  1. A JSON data source that represents service metadata.
  2. An ingestion script that turns service records into searchable documents.
  3. A Chroma vector store backed by OpenAI embeddings.
  4. A LangChain retrieval chain that answers questions using retrieved context.
  5. A FastAPI endpoint that exposes the query engine.
  6. A few practical guardrails so the prototype does not lie with confidence.

The example questions are intentionally operational:

Who owns checkout-service?
Where is the Terraform for the staging database?
Which services deployed in the last 24 hours?
What runbook should I use for payment retries?

The important design constraint is this: the model should not be the database. It should be the language layer over data you control.

Why Not Just Use SQL?

If all the data lived in a clean relational model, SQL would be better. For questions like "which services deployed in the last 24 hours," a structured query against a deployment table beats semantic search every time.

Real developer metadata is messier than that:

  • Service ownership may live in Backstage YAML.
  • Runbooks may live in Markdown.
  • Terraform module names may live in code.
  • Deployment events may come from CI/CD systems.
  • Incident summaries may live in ticketing systems.
  • Engineers may use three names for the same service.

Natural language querying helps when the developer does not know where to look or what exact field name to search. It is less useful when the question is already well-structured. A good system should eventually combine both: semantic retrieval for fuzzy discovery, structured queries for facts that need precision.

That distinction matters because it keeps the prototype honest.

Project Layout

Here is a small but realistic layout:

dev-query/
  app/
    api.py
    ingest.py
    query.py
    settings.py
  data/
    services.json
  storage/
    chroma/
  pyproject.toml

For a prototype, keep the pieces boring. Do not start with Slack, a browser UI, single sign-on, and five data sources. Start with one data source and prove that the answer quality is worth more investment.

Install The Dependencies

LangChain has moved toward separate integration packages. For this prototype, install the core packages, OpenAI integration, Chroma integration, FastAPI, and Uvicorn:

python -m venv .venv
source .venv/bin/activate

pip install \
  langchain \
  langchain-openai \
  langchain-chroma \
  chromadb \
  fastapi \
  uvicorn \
  pydantic-settings

Then set the model provider key:

export OPENAI_API_KEY="..."

Do not bake API keys into config files, Docker images, source examples, or screenshots. Internal developer tools have a bad habit of becoming production tools after everyone has forgotten the prototype shortcuts.

Create A Small Metadata Source

Start with data/services.json:

[
  {
    "name": "checkout-service",
    "owner": "team-payments",
    "slack": "#team-payments",
    "pagerduty": "payments-primary",
    "lifecycle": "production",
    "repo": "github.com/example/checkout-service",
    "docs": "https://internal.example.com/runbooks/checkout",
    "infra": "terraform/services/checkout/rds.tf",
    "last_deploy": "2026-06-08T16:22:00Z",
    "summary": "Handles cart checkout, payment authorization, and order handoff."
  },
  {
    "name": "catalog-api",
    "owner": "team-commerce-platform",
    "slack": "#commerce-platform",
    "pagerduty": "commerce-platform-primary",
    "lifecycle": "production",
    "repo": "github.com/example/catalog-api",
    "docs": "https://internal.example.com/runbooks/catalog",
    "infra": "terraform/services/catalog/opensearch.tf",
    "last_deploy": "2026-06-07T21:10:00Z",
    "summary": "Serves product catalog search and detail APIs."
  }
]

This is deliberately simple, but notice what it includes:

  • Ownership.
  • Communication channel.
  • On-call reference.
  • Repository.
  • Runbook URL.
  • Infrastructure path.
  • Deployment timestamp.
  • Human-readable summary.

Those fields are the difference between a useful developer assistant and a parlor trick. If the source data is thin, the answer will be thin.

Settings

Create app/settings.py:

from pathlib import Path

from pydantic_settings import BaseSettings


class Settings(BaseSettings):
    data_path: Path = Path("data/services.json")
    persist_directory: Path = Path("storage/chroma")
    collection_name: str = "developer_services"
    embedding_model: str = "text-embedding-3-small"
    chat_model: str = "gpt-4.1-mini"


settings = Settings()

For a real internal system, configuration should include environment-specific data paths, auth settings, logging configuration, and model routing. For a prototype, this is enough.

Ingest The Data

Create app/ingest.py:

import json

from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

from app.settings import settings


def service_to_document(service: dict) -> Document:
    text = f"""
Service: {service["name"]}
Owner: {service["owner"]}
Slack: {service["slack"]}
PagerDuty: {service["pagerduty"]}
Lifecycle: {service["lifecycle"]}
Repository: {service["repo"]}
Documentation: {service["docs"]}
Infrastructure: {service["infra"]}
Last deploy: {service["last_deploy"]}
Summary: {service["summary"]}
""".strip()

    return Document(
        page_content=text,
        metadata={
            "service": service["name"],
            "owner": service["owner"],
            "repo": service["repo"],
            "docs": service["docs"],
            "source_type": "service_catalog",
        },
    )


def rebuild_index() -> None:
    services = json.loads(settings.data_path.read_text())
    documents = [service_to_document(service) for service in services]

    embeddings = OpenAIEmbeddings(model=settings.embedding_model)

    Chroma.from_documents(
        documents=documents,
        embedding=embeddings,
        collection_name=settings.collection_name,
        persist_directory=str(settings.persist_directory),
    )


if __name__ == "__main__":
    rebuild_index()
    print(f"Indexed services from {settings.data_path}")

Run it:

python -m app.ingest

This builds a local Chroma index. In production, you would probably ingest from Backstage, GitHub, CI, Terraform state, docs, and incident tooling on a schedule. But do not start there. Start with one source, learn what answer quality looks like, then add sources intentionally.

Build The Query Chain

Older LangChain examples often use RetrievalQA. That pattern still appears in plenty of tutorials, but the current direction is to compose retrieval and generation more explicitly. That is good. You want the prompt, retriever, and answer behavior to be visible.

Create app/query.py:

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from app.settings import settings


SYSTEM_PROMPT = """
You are an internal developer metadata assistant.

Answer only from the retrieved context. If the context does not contain the
answer, say that you do not know and suggest the next system to check.

Return:
- answer: concise direct answer
- sources: service names, docs URLs, repos, or infra paths used
- confidence: high, medium, or low

Do not invent owners, on-call rotations, deployment times, repositories, or
infrastructure paths.

Context:
{context}
"""


def build_chain():
    embeddings = OpenAIEmbeddings(model=settings.embedding_model)
    vector_store = Chroma(
        collection_name=settings.collection_name,
        persist_directory=str(settings.persist_directory),
        embedding_function=embeddings,
    )

    retriever = vector_store.as_retriever(search_kwargs={"k": 4})

    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", SYSTEM_PROMPT),
            ("human", "{input}"),
        ]
    )

    llm = ChatOpenAI(model=settings.chat_model, temperature=0)
    document_chain = create_stuff_documents_chain(llm, prompt)
    return create_retrieval_chain(retriever, document_chain)


def ask(question: str) -> dict:
    chain = build_chain()
    result = chain.invoke({"input": question})
    return {
        "question": question,
        "answer": result["answer"],
        "context_count": len(result.get("context", [])),
        "sources": [
            document.metadata for document in result.get("context", [])
        ],
    }

The prompt does a few important things:

  • It tells the model to answer only from retrieved context.
  • It makes uncertainty acceptable.
  • It asks for sources.
  • It forbids invented operational facts.

That will not eliminate hallucinations, but it makes the desired behavior explicit and reviewable. This is the same basic discipline behind writing secure prompts for developer workflows: scope the task, constrain the output, and make failure modes visible.

Add A CLI For Fast Testing

Before building an API, add the cheapest possible interface:

from app.query import ask


while True:
    question = input("dev-query> ").strip()
    if question in {"exit", "quit"}:
        break
    if not question:
        continue

    result = ask(question)
    print()
    print(result["answer"])
    print()
    print("Sources:")
    for source in result["sources"]:
        print(f"- {source}")
    print()

Run it:

python cli.py

Try:

dev-query> Who owns checkout-service?
dev-query> Where is the checkout infrastructure?
dev-query> What is the on-call policy for search?

That last question should probably fail or return low confidence unless your context actually includes the answer. A useful prototype is not one that answers everything. A useful prototype knows when the indexed data is insufficient.

Expose It With FastAPI

Once the CLI works, add app/api.py:

from fastapi import FastAPI
from pydantic import BaseModel, Field

from app.query import ask


class QueryRequest(BaseModel):
    question: str = Field(min_length=3, max_length=500)


class QueryResponse(BaseModel):
    question: str
    answer: str
    context_count: int
    sources: list[dict]


app = FastAPI(title="Developer Metadata Query API")


@app.post("/ask", response_model=QueryResponse)
def ask_question(request: QueryRequest) -> dict:
    return ask(request.question)

Run it:

uvicorn app.api:app --reload

Then call it:

curl -s http://127.0.0.1:8000/ask \
  -H 'content-type: application/json' \
  -d '{"question":"Who owns checkout-service and where are its docs?"}' | jq

At this point, you have a real prototype boundary:

  • Ingestion can run separately.
  • Querying can be tested independently.
  • The API has request and response schemas.
  • A future Slack bot or internal portal can call the same backend.

That is already better than stuffing a model call into a random bot handler and hoping nobody asks where the data came from.

What To Do About Slack

Slack is a good interface and a bad first architecture.

It is a good interface because developers already ask operational questions there. It is a bad first architecture because Slack-specific concerns can quickly bury the retrieval problem: permissions, retries, event signatures, ephemeral messages, slash command UX, rate limits, and response timeouts.

Build the query API first. Then wire Slack to the API.

A slash command should do roughly this:

  1. Acknowledge the command quickly.
  2. Send the question to your internal query API.
  3. Return a concise answer.
  4. Include links to sources.
  5. Avoid posting sensitive answers into public channels.

That last point is not optional. Developer metadata can expose service topology, incident history, internal repositories, and team responsibilities. Treat it as internal data, not public trivia.

Evaluation: The Part Everyone Skips

The prototype is not done when it returns a pretty answer. It is done when you have some idea whether the answer is right.

Create a small evaluation set:

[
  {
    "question": "Who owns checkout-service?",
    "must_include": ["team-payments"],
    "must_not_include": ["team-commerce-platform"]
  },
  {
    "question": "Where is the checkout Terraform?",
    "must_include": ["terraform/services/checkout/rds.tf"]
  },
  {
    "question": "Who owns a service named pricing-v3?",
    "must_include": ["do not know"]
  }
]

Then run it every time you change the prompt, model, chunking strategy, or data source. You do not need a fancy evaluation platform on day one. A small script that catches obvious regressions is enough to keep you honest.

Later, you can add:

  • Relevance scoring for retrieved documents.
  • Human review of low-confidence answers.
  • Trace logging with LangSmith or another observability tool.
  • Per-source freshness checks.
  • Feedback buttons in Slack or the portal UI.

Without evaluation, you are just vibe-testing infrastructure answers. That is not a great career strategy.

Production Considerations

There are several gaps between this prototype and a production internal developer assistant.

Permissions

The prototype retrieves from one local dataset. A real system needs per-user authorization. If a developer cannot see a repository, incident, document, or deployment record directly, the assistant should not reveal it indirectly.

This is the part teams underestimate. Retrieval systems can become permission laundering systems if you index everything into one bucket and forget who is allowed to see what.

Freshness

Developer metadata goes stale quickly. Ownership changes. Services move. Runbooks get replaced. Deployment state changes hourly.

For structured data, prefer live API calls when freshness matters. For docs and runbooks, scheduled re-indexing may be fine. For deployment status, a vector index may be the wrong tool entirely.

Source Quality

The model cannot rescue bad metadata. If half your service catalog has owner: unknown, your assistant will mostly automate disappointment.

Use the prototype to expose metadata quality problems. That may be the biggest organizational value of the project.

Observability

Log the question, retrieved document IDs, model version, prompt version, latency, and whether the answer was rated useful. Do not log secrets. Do not log raw private content unless you have a clear retention policy.

Prompt and model changes should be versioned. Otherwise, when the assistant starts giving worse answers, you will have no idea why.

Where LangChain Helps

LangChain is useful here because it gives you a common vocabulary for prompts, retrievers, models, documents, chains, and integrations. It also makes it easier to swap pieces while you are learning.

But the framework is not the product. The product is the engineering workflow:

  • Can developers find the right owner faster?
  • Can new hires discover systems without interrupting three people?
  • Can incident responders find the relevant runbook under pressure?
  • Can platform teams see where metadata is missing?

If LangChain helps you answer those questions faster, use it. If a smaller custom service would do the job better, use that. The architecture should serve the workflow, not the other way around.

Conclusion

A natural language developer query tool is worth building when it is grounded in real metadata, honest about uncertainty, and connected to the systems engineers already use.

The prototype in this article is intentionally modest: JSON metadata, Chroma, OpenAI embeddings, a LangChain retrieval chain, and FastAPI. That is enough to learn the important lessons without burying yourself in platform work.

Start small. Keep sources visible. Add evaluation earlier than feels necessary. Respect permissions. Do not let the model invent operational facts. If the first version mostly teaches your organization that its metadata is incomplete, that is still useful information.

The goal is not an all-knowing AI teammate. The goal is a practical developer tool that turns scattered engineering metadata into answers people can verify.

For more practical engineering and developer tooling notes, visit Slaptijack.

Slaptijack's Koding Kraken