-

Every legend needs a rebel. I dare you to write the next one with your own hands.

Teaching AI to Query – Part 2: From Knowledge Graph to Dynamic SQL with Claude, Neo4j, and a Dash of Prompt Engineering

Once the idea was clear use a knowledge graph to teach an LLM about your schema the next step was execution.

So let’s get technical.


Making the LLM Endpoint a Variable

Instead of hardcoding my LLM choice, I made the endpoint dynamic. Why? Because experimentation is inevitable — and when you’re iterating across Claude, GPT-4, Mistral, or your own fine-tuned models, you don’t want to rip apart your backend every time.

I created a LLM_ENDPOINT variable — something like:

python
LLM_ENDPOINT = os.getenv("ENDPOINT_SQL_GENERATOR")

Now switching providers is as simple as updating a .env value or flipping a dropdown in the Streamlit UI.

For this phase, I used Claude 3.5 Sonnet it’s fast, sharp, and especially good at structured reasoning.


Walking the Knowledge Graph with Neo4j

Before generating SQL, the LLM needs to understand the question in the context of the schema.

Here’s how I made that happen:

  1. User prompt (e.g., “Show me average revenue by region over the last 6 months”) comes in.

  2. I query Neo4j to get:

    • Relevant table nodes

    • Relationships (foreign keys, join paths)

    • Field properties (names, types, descriptions, tags, display_name)

  3. I package that into a graph context block — a structured format the LLM can parse cleanly.

Example snippet of the graph context passed in:

json
{
"tables": ["sales_data", "regions"],
"relationships": [
{ "from": "sales_data.region_id", "to": "regions.region_id" }
],
"fields": {
"sales_data.revenue": { "type": "decimal", "display_name": "Revenue" },
"sales_data.date": { "type": "date", "display_name": "Sale Date" },
"regions.name": { "type": "string", "display_name": "Region Name" }
}
}

This isn’t just metadata it’s contextual fuel.


The Prompt Engineering Layer

Next came the trickiest part: getting consistently clean SQL from the LLM.

I learned this fast: raw prompting = unpredictable results. You need structure.

Here’s the flow I use now:

  1. Instruction Layer Teach the LLM how to behave

    sql
    You are a SQL analyst. Use only the tables and fields listed below. Optimize for clarity and correctness.
  2. Context Layer – the graph walk output

  3. Prompt Layer – Inject the user’s question

    sql
    User prompt: “Show me average revenue by region over the last 6 months”
  4. Constraints Layer – Reinforce best practices

    pgsql
    - Use table aliases
    - Format joins explicitly
    - Include `WHERE` clauses for temporal filtering
    - Output a single SQL query only

By composing the prompt this way, Claude produced clean, explainable SQL 90%+ of the time. Here’s an example of what it returned:

sql
SELECT
r.name AS region_name,
AVG(s.revenue) AS avg_revenue
FROM sales_data s
JOIN regions r ON s.region_id = r.region_id
WHERE s.date >= CURRENT_DATE - INTERVAL '6 months'
GROUP BY r.name
ORDER BY avg_revenue DESC;

And if it failed or was missing something? I had fallback logic to auto-retry with minor tweaks, including more schema hints or adjusted date logic.


What’s Next

In Part 3, I’ll show how I closed the loop:

  • Sending that LLM-generated SQL to Databricks

  • Streaming the result back to the app

  • Asking the LLM to explain the data in plain English, drawing from the same field definitions in the graph

This isn’t just about querying faster.

It’s about building an analyst that thinks with your data — and can talk to you about what it finds.

Part 1:  Teaching AI to Query: Building a Smarter Analyst with Knowledge Graphs, LLMs, and SQL

No Comment

Leave a Reply

Your email address will not be published. Required fields are marked *