I in both a graph database and a SQL database, then used various large language models (LLMs) to answer questions about the data through a retrieval-augmented generation (RAG) approach. By using the same dataset and questions across both systems, I evaluated which database paradigm delivers more accurate and insightful results.
Retrieval-Augmented Generation (RAG) is an AI framework that enhances large language models (LLMs) by letting them retrieve relevant external information before generating an answer. Instead of relying solely on what the model was trained on, RAG dynamically queries a knowledge source (in this article a SQL or graph database) and integrates those results into its response. An introduction to RAG can be found [here](https://towardsdatascience.com…
I in both a graph database and a SQL database, then used various large language models (LLMs) to answer questions about the data through a retrieval-augmented generation (RAG) approach. By using the same dataset and questions across both systems, I evaluated which database paradigm delivers more accurate and insightful results.
Retrieval-Augmented Generation (RAG) is an AI framework that enhances large language models (LLMs) by letting them retrieve relevant external information before generating an answer. Instead of relying solely on what the model was trained on, RAG dynamically queries a knowledge source (in this article a SQL or graph database) and integrates those results into its response. An introduction to RAG can be found here.
SQL databases organize data into tables made up of rows and columns. Each row represents a record, and each column represents an attribute. Relationships between tables are defined using keys and joins, and all data follows a fixed schema. SQL databases are ideal for structured, transactional data where consistency and precision are important — for example, finance, inventory, or patient records.
Graph databases store data as nodes (entities) and edges (relationships) with optional properties attached to both. Instead of joining tables, they directly represent relationships, allowing for fast traversal across connected data. Graph databases are ideal for modelling networks and relationships — such as social graphs, knowledge graphs, or molecular interaction maps — where connections are as important as the entities themselves.
Data
The dataset I used to compare the performance of RAGs contains Formula 1 results from 1950 to 2024. It includes detailed results at races of drivers and constructors (teams) covering qualifying, sprint race, main race, and even lap times and pit stop times. The standings of the drivers and constructors’ championships after every race are also included.
SQL Schema
This dataset is already structured in tables with keys so that a SQL database can be easily set up. The database’s schema is shown below:
SQL Database Design
Races is the central table which is linked with all types of results as well as additional information like season and circuits. The results tables are also linked with Drivers and *Constructors *tables to record their result at each race. The championship standings after each race are stored in the *Driver_standings *and *Constructor_standings *tables.
Graph Schema
The schema of the graph database is shown below:
Graph Database Design
As graph databases can store information in nodes and relationships it only requires six nodes compared to 14 tables of the SQL database. The Car node is an intermediate node that is used to model that a driver drove a car of a constructor at a particular race. Since driver – constructor pairings are changing over time, this relationship needs to be defined for each race. The race results are stored in the relationships e.g. :RACED between *Car *and *Race. *While the :STOOD_AFTER relationships contain the driver and constructor championship standings after each race.
Querying the Database
I used LangChain to build a RAG chain for both database types that generates a query based on a user question, runs the query, and converts the query result to an answer to the user. The code can be found in this repo. I defined a generic system prompt that could be used to generate queries of any SQL or graph database. The only data specific information was included by inserting the auto-generated database schema into the prompt. The system prompts can be found here.
Here is an example how to initialize the model chain and ask the question: “What driver won the 92 Grand Prix in Belgium?”
from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI
from qa_chain import GraphQAChain
from config import DATABASE_PATH
# connect to database
connection_string = f"sqlite:///{DATABASE_PATH}"
db = SQLDatabase.from_uri(connection_string)
# initialize LLM
llm = ChatOpenAI(temperature=0, model="gpt-5")
# initialize qa chain
chain = GraphQAChain(llm, db, db_type='SQL', verbose=True)
# ask a question
chain.invoke("What driver won the 92 Grand Prix in Belgium?")
Which returns:
{'write_query': {'query': "SELECT d.forename, d.surname
FROM results r
JOIN races ra ON ra.raceId = r.raceId
JOIN drivers d ON d.driverId = r.driverId
WHERE ra.year = 1992
AND ra.name = 'Belgian Grand Prix'
AND r.positionOrder = 1
LIMIT 10;"}}
{'execute_query': {'result': "[('Michael', 'Schumacher')]"}}
{'generate_answer': {'answer': 'Michael Schumacher'}}
The SQL query joins the Results, Races, and Drivers tables, selects the race at the 1992 Belgian Grand Prix and the driver who finished first. The LLM converted the year 92 to 1992 and the race name from “Grand Prix in Belgium” to “Belgian Grand Prix”. It derived these conversions from the database schema which included three sample rows of each table. The query result is “Michael Schumacher” which the LLM returned as answer.
Evaluation
Now the question I want to answer is if an LLM is better in querying the SQL or the graph database. I defined three difficulty levels (easy, medium, and hard) where easy were questions that could be answered by querying data from only one table or node, medium were questions which required one or two links among tables or nodes and hard questions required more links or subqueries. For each difficulty level I defined five questions. Additionally, I defined five questions that could not be answered with data from the database.
I answered each question with three LLM models (GPT-5, GPT-4, and GPT-3.5-turbo) to analyze if the most advanced models are needed or older and cheaper models could also create satisfactory results. If a model gave the correct answer, it got 1 point, if it replied that it could not answer the question it got 0 points, and in case it gave a wrong answer it got -1 point. All questions and answers are listed here. Below are the scores of all models and database types:
| Model | Graph DB | SQL DB |
| GPT-3.5-turbo | -2 | 4 |
| GPT-4 | 7 | 9 |
| GPT-5 | 18 | 18 |
Model – Database Evaluation Scores
It is remarkable how more advanced models outperform simpler models: GPT-3-turbo got about half the number of questions wrong, GPT-4 got 2 to 3 questions wrong but could not answer 6 to 7 questions, and GPT-5 got all except one question correct. Simpler models seem to perform better with a SQL than graph database while GPT-5 achieved the same score with either database.
The only question GPT-5 got wrong using the SQL database was “Which driver won the most world championships?”. The answer “Lewis Hamilton, with 7 world championships” is not correct because Lewis Hamilton and Michael Schumacher won 7 world championships. The generated SQL query aggregated the number of championships by driver, sorted them in descending order and only selected the first row while the driver in the second row had the same number of championships.
Using the graph database, the only question GPT-5 got wrong was “Who won the Formula 2 championship in 2017?” which was answered with “Lewis Hamilton” (Lewis Hamilton won the Formula 1 but not Formula 2 championship that year). This is a tricky question because the database only contains Formula 1 but not Formula 2 results. The expected answer would have been to reply that this question could not be answered based on the provided data. However, considering that the system prompt did not contain any specific information about the dataset it is understandable that this question was not correctly answered.
Interestingly using the SQL database GPT-5 gave the correct answer “Charles Leclerc”. The generated SQL query only searched the drivers table for the name “Charles Leclerc”. Here the LLM must have recognized that the database does not contain Formula 2 results and answered this question from its common knowledge. Although this led to the correct answer in this case it can be dangerous when the LLM is not using the provided data to answer questions. One way to reduce this risk could be to explicitly state in the system prompt that the database must be the only source to answer questions.
Conclusion
This comparison of RAG performance using a Formula 1 results dataset shows that the latest LLMs perform exceptionally well, producing highly accurate and contextually aware answers without any additional prompt engineering. While simpler models struggle, newer ones like GPT-5 handle complex queries with near-perfect precision. Importantly, there was no significant difference in performance between the graph and SQL database approaches – users can simply choose the database paradigm that best fits the structure of their data.
The dataset used here serves only as an illustrative example; results may differ when using other datasets, especially those that require specialized domain knowledge or access to non-public data sources. Overall, these findings highlight how far retrieval-augmented LLMs have advanced in integrating structured data with natural language reasoning.
If not stated otherwise, all images were created by the author.