With the final season of Stranger Things arriving, Neo4j has done something that may sound frivolous but turns out to be an interesting use case. It built HopperGraph, a system that analyzes fan theories from Reddit to predict which are most likely to prove accurate. The project draws on 150,000 posts, organized into 234,000 nodes and 1.5 million relationships – data points and the connections between them – and the methodology required to make it work has direct parallels to problems enterprises face with their own data.
Stephen Chin, Vice President of Developer Relations at Neo4j, explains that the project was designed to showcase graph database technology – databases built specifically to store and query relationships between data – in an accessible w…
With the final season of Stranger Things arriving, Neo4j has done something that may sound frivolous but turns out to be an interesting use case. It built HopperGraph, a system that analyzes fan theories from Reddit to predict which are most likely to prove accurate. The project draws on 150,000 posts, organized into 234,000 nodes and 1.5 million relationships – data points and the connections between them – and the methodology required to make it work has direct parallels to problems enterprises face with their own data.
Stephen Chin, Vice President of Developer Relations at Neo4j, explains that the project was designed to showcase graph database technology – databases built specifically to store and query relationships between data – in an accessible way. But the challenges his team encountered, and how they solved them, speak to broader questions about how organizations should architect AI systems when simple similarity search isn’t enough.
The super-user problem
The goal was to identify which Reddit communities had historically made accurate predictions about the show, then surface their theories about the final season. Neo4j used community detection algorithms to find clusters of related discussions, combined with a Large Language Model (LLM) to analyze whether comments represented predictions and whether those predictions were made before the relevant episode aired.
An early implementation anchored the analysis on individual Reddit users. The result was predictable in hindsight: super-users who comment on every thread dominated the results. These high-volume participants weren’t necessarily better predictors – they just showed up everywhere.
Chin explains that the team remodeled the approach to anchor on threads rather than users:
The dialogue and conversation in a thread between multiple users is what we’re using as a predictor, rather than having it be weighted by one user who just pops in every thread and says a few comments.
This is not a problem unique to fan theories. Chin draws a parallel to financial data, where companies merge, get acquired, and change over time. Tracking individual corporate entities gives a fragmented picture. Tracking the underlying assets as they move between entities over longer periods produces a stronger predictor of behavior. The same principle applies: anchor your analysis on the relationships that matter, not the nodes that happen to be most visible.
Why vector databases aren’t enough
The comparison between graph databases and relational databases is well established – multi-level joins across relational tables are slower than equivalent hops across a graph. But Chin focuses on a more current question, where graph databases outperform vector databases, which convert text into numerical representations for similarity matching and have become popular infrastructure for LLM applications.
Vector embeddings, Chin notes, amount to similarity search. They work well when a query can be summarized with a single match or relationship - finding products similar to a search term, for instance, or retrieving documents that mention a specific topic. Where they struggle is with queries that involve multiple relationships and constraints.
He points to shopping agents as an example of the limitation. After several years of LLM availability, finding a product with specific characteristics and comparing options should be straightforward. In practice, agents are good at surfacing what other people have recommended but fall apart on specifics. Chin is blunt:
When you actually get to details, like you want dimensions or sizes or specifications, they’re useless, right? They don’t actually have the data categorized in a meaningful way to take actions.
He offers another example from a customer working with public RFP records. A vector database can reliably answer simple questions - are there any projects that might involve painting? But ask a more complex question – "are there projects available in a certain time window that match the services a catering company provides, cross-referenced against its menu and capacity?" – and the system struggles. It lacks the structure to match against multiple databases, handle time-span information, and navigate unstructured project documents simultaneously. Chin observes:
LLMs are not magical. They hit the same limitation with large data sets. If you can’t expose that structure in a meaningful way – and graph databases are a great way of doing this – then the LLM will struggle to give back good results.
The end of easy use cases
Chin is candid about the trajectory of AI implementation. When LLMs arrived a few years ago, he recalls:
...the expectations of what’s possible were just huge. Suddenly, LLMs are doing our kids’ homework. They’re going to be able to solve all these big enterprise challenges.
The reality has been more complicated. He summarizes where things stand:
We’re running out of the easy use cases, where we just throw a large set of unstructured data on it and magic comes out the other side. Now we’re hitting the useful, but much harder set of use cases where we actually want to do some work and get meaningful results.
A law firm, Quarles and Brady, illustrates what this harder work looks like. The firm built a legal research system using graph database architecture, pulling in millions of historical documents across formats ranging from modern PDFs to legacy PageMaker files. The system needed to track not just case content but timeframes – when laws took effect, when they expired, which jurisdictions they applied to. The project required data cleanup to remove duplicate nodes and spurious relationships, schema guidance during construction, and production infrastructure including monitoring and Kubernetes deployment. Chin notes that knowledge graphs built with LLM assistance tend to produce messier results than those constructed by data scientists - useful acceleration, but not a substitute for rigor.
Where this is heading
Chin predicts that graph algorithms and databases will become core AI infrastructure. Graph databases are flexible structures that don’t require a fixed data model upfront, making them easy to get started with. LLMs are effective at constructing graph queries, making the data more accessible to end users. And research is increasingly focused on graph neural networks, agent memory storage, and graph-based agent orchestration.
There’s also a practical pressure driving this convergence. Improvements between LLM generations are slowing – Chin notes that the gap between versions four and five represents only 10-15% improvement on practical benchmarks, and future gaps may be smaller still. If the models themselves are plateauing, augmenting them with structured data becomes more important. As he puts it:
We need to find other ways we can expand the usage and meet the promise of what we’ve all been signed up to deliver from AI technology.
My take
There’s a pleasing irony in using a TV show about parallel dimensions to demonstrate a point about data architecture. The Upside Down, in Stranger Things, is a dark mirror of the real world – recognizable but distorted, hostile to navigation. That’s not a bad description of what happens when organizations try to run complex queries against unstructured data without the right infrastructure underneath.
The thread-versus-user insight from HopperGraph is the kind of modeling decision that separates useful AI implementations from impressive demos. It’s easy to anchor on the most visible data points. It’s harder – and more valuable – to identify which relationships actually predict outcomes. That’s true whether you’re forecasting which fan community has the best track record on plot predictions or trying to match Request for Proposals (RFPs) against a catering company’s capabilities.
Chin’s observation that the easy use cases are running out deserves attention. The first wave of generative AI adoption rewarded speed over architecture. The next wave will reward organizations that did the data preparation work – even when it wasn’t obvious yet why they needed to.
There’s a fitting detail in HopperGraph itself. The Demogorgon agent has access to the same underlying graph data as Eleven or Max – but when you chat with it, it just outputs incomprehensible noise. Same data, different interface, untranslated results. As Chin observed: "You just don’t know how to interpret it." The Demogorgon, as any fan knows, is widely misunderstood. So, apparently, is what it takes to make AI actually work.