Would you like to use a Large Language Model (LLM) to extract information from your Knowledge Graph (KG), but your graph contains sensitive data? That’s usually a problem, especially if you rely on third-party LLM APIs. In this post we present a privacy-aware query generation approach that identifies sensitive information in the graph and masks it before sending anything to the LLM. Our experiments indicate that this preserves query quality while preventing sensitive data from leaving your system.
Background
Querying a KG usually requires writing SPARQL or Cypher, which demands both domain knowledge and familiarity with the graph structure. LLMs can simplify this by generating the query directly from a natural language question. However, if dealing with private inform…
Would you like to use a Large Language Model (LLM) to extract information from your Knowledge Graph (KG), but your graph contains sensitive data? That’s usually a problem, especially if you rely on third-party LLM APIs. In this post we present a privacy-aware query generation approach that identifies sensitive information in the graph and masks it before sending anything to the LLM. Our experiments indicate that this preserves query quality while preventing sensitive data from leaving your system.
Background
Querying a KG usually requires writing SPARQL or Cypher, which demands both domain knowledge and familiarity with the graph structure. LLMs can simplify this by generating the query directly from a natural language question. However, if dealing with private information, sending raw data to external LLM services can be unethical or even illegal. Running your own LLM isn’t always feasible either. Until now, this forced users to rely on pre-LLM alternatives with much lower performance.
Privacy-Aware Knowledge Graph Q&A
Sensitive data may leak to an LLM in two ways:
- through the context sent to it (often containing KG values), and
- through the user’s own question. For instance, “Other than Bad Boys, which movies did Will Smith and Martin Lawrence co-star in?” already reveals a possible collaboration between both actors, which could be sensitive information.
Let’s see how we avoid these two potential dangers in our privacy-aware query generation approach
Using only graph structure as context
Most KGs are built around an ontology or predefined structure. This structure describes the graph’s logic without exposing sensitive values. We simply send this structure, not the actual instance of the concepts,to the LLM.
Masking user questions
User questions may still include sensitive terms. To prevent leaks and help the LLM understand the KG vocabulary, we detect:
Sensitive values that cannot be sent, and
Key entities such as node labels, relations, properties, and their synonyms.
Then,** we mask them** and substitute the synonyms of the key entities by their original values in the KG, as seen below.
This matching is currently done via case-insensitive token comparison with KG labels or predefined synonyms. We also allow users to actively indicate sensitive values between brackets.
Privacy-aware LLM prompt and reply
After abstracting the concepts and masking, we send a zero-shot prompt with the KG structure and the masked question to the LLM. Once the LLM generates a Cypher query, we replace the masked tokens with their original values and execute the query.
In practice
We have implemented this approach using the BESSER Agentic Framework (BAF) and compared query quality with and without masking. The results were almost identical: GPT-5 achieved 83.1% accuracy without masking and 82.5% with masking. In other words, privacy and efficiency are not trade-offs in this use case.
For more details, check out the full paper and the complete implementation on GitHub.
Happy to discuss and hear your thoughts!