Spectral Community Detection in Clinical Knowledge Graphs

Introduction

do we identify latent groups of patients in a large cohort? How can we find similarities among patients that go beyond the well-known comorbidity clusters associated with specific diseases? And more importantly, how can we extract quantitative signals that can be analyzed, compared, and reused across different clinical scenarios?

The information associated to cohorts of patients consists of large corpora that come in various formats. The data is usually difficult to process due its quality and complexity, with overlapping symptoms, ambiguous diagnoses and numerous abbreviations.

These datasets are usually highly interconnected and provide perfect examples where the use of knowledge graphs is quite beneficial. A graph has the advantage of making the relationships betwe…

Introduction

These datasets are usually highly interconnected and provide perfect examples where the use of knowledge graphs is quite beneficial. A graph has the advantage of making the relationships between patients and the related entities (diseases in our case) explicit, preserving all the connections between these features.

In a graph setting we are replacing the standard clustering methods (e.g. k-means) with community detection algorithms which are identifying how the groups of patients organize themselves via common syndromes.

With these observations in mind, we arrive to our exploratory question:

How can we layer graph algorithms with spectral methods to reveal clinically meaningful structure in patient populations that traditional approaches miss?

To address this question, I built an end-to-end clinical graph pipeline that generates synthetic notes, extracts Disease entities, constructs a Neo4j patient-disease knowledge graph, detects communities with the Leiden algorithm, and analyzes their structure using algebraic connectivity and the Fiedler vector.

The Leiden algorithm partitions the graph into clusters, but it does not give information into the internal structure of these communities.

This is where spectral graph theory becomes relevant. Associated to any graph, we can construct matrices such as the adjacency matrix and the graph Laplacian whose eigenvalues and eigenvectors encode structural information about the graph. In particular, the second smallest eigenvalue of the Laplacian (the algebraic connectivity) and its associated eigenvector (the Fiedler vector) are going to play an essential role in the upcoming analysis.

In this blog, the readers will see how:

the synthetic clinical notes are generated,
the disease entities are extracted and parsed,
the Leiden communities are leveraged to extract information about the cohort,
the algebraic connectivity measures the strength of a community,
the Fiedler vector is leveraged to further partition communities.

Even in a small synthetic dataset, some communities form coherent syndromes, while others reflect coincidental conditions overlap. Spectral methods give us a precise way to measure these differences and reveal structure that would otherwise go unnoticed. Although this project operates on synthetic data, the approach generalizes to real-world clinical datasets, and shows how the spectral insights complement the community detection methods.

💡Data, Code & Images:

Data Disclaimer: All examples in this article use a fully synthetic dataset of clinical notes generated specifically for this project.

Code Source: All code, synthetic data, notebooks and configuration files are available in the companion GitHub repository. The knowledge graph is built using the Neo4j Desktop with the GDS plugin. You can reproduce the full pipeline, from synthetic note generation to Neo4j graph analysis and spectral computations, in Google Colab and/or a local Python environment.

Images: All figures and visualizations in this article were created by the author.

Methodology Overview

In this section we outline the steps of the project, from synthetic clinical text generation to community detection and spectral analysis.

The workflow proceeds as follows:

Synthetic Data Generation. Produce a corpus of about 740 synthetic history of present illness (HPI) style clinical notes with controlled disease and clear note formatting instructions.
Entity Extraction and Deduplication. Extract Disease entities using an OpenMed NER model and apply a fuzzy matching deidentification layer.
Knowledge Graph Construction. Create a bipartite graph with schema Patient - HAS_DISEASE -> Disease.
Community Detection. Apply the Leiden community detection algorithm to identify clusters of patients that share related conditions.
Spectral Analysis. Compute the algebraic connectivity to measure the internal homogeneity of each community, and use the Fiedler vector to partition the communities in meaningful sub-clusters.

This brief overview establishes the full analytical flow. The next section details how the synthetic clinical notes were generated.

Synthetic Data Generation

For this project, I generated a corpus of synthetic clinical notes using the OpenAI API, working in Google Colab for convenience. The full prompt and implementation details are available in the repository.

After several iterations, I implemented a dynamic prompt that randomly selects a patient’s age and gender to ensure variability across samples. Below is a summary of the main constraints from the prompt:

Clinical narrative: coherent narratives focused on 1-2 dominant organ systems, with natural causal progression.
Controlled entity density: each note contains 6-10 meaningful conditions or symptoms, with guardrails to prevent entity overload.
Diversity controls: diseases are sampled across the common to rare spectrum in specified proportions and the primary organ systems are selected uniformly from 12 categories.
Safety constraints: no identifying information is included.

A key challenge in constructing such a synthetic dataset is avoiding an over-connected graph where many patients share the same handful of conditions. A simpler prompt may create relatively good individual patient notes but a poor overall distribution of diseases. To counteract this, I specifically asked the model to think through its choices and to periodically reset its selection pattern preventing repetition. These instructions increase model’s decision complexity and slow generation, but yield a more diverse and realistic dataset. Generating 1,000 samples with gpt-5-mini took about 4 hours.

Each generated sample includes two features: a clinical_note (the generated text) and a patient_id (unique identifier assigned during generation). About 260 entries were blank and were removed during preprocessing, leaving 740 notes, which is sufficient for this mini-project.

For context, here is a sample synthetic clinical note from the dataset:

“A 50-year-old man presents with six weeks of progressive exertional dyspnea and a persistent nonproductive cough that began after a self-limited bronchitis. … He reports daytime fatigue and loud snoring with witnessed pauses consistent with obstructive sleep apnea; he has well-controlled hypertension and a 25 pack-year smoking history but quit five years ago. He denies fever or orthopnea.”

✨Insights: Synthetic data is convenient to obtain, especially when medical datasets require special permissions. Despite its usefulness for concept demonstration, synthetic data can be unreliable for drawing clinical conclusions and it should not be used for clinical inference.

With the dataset prepared, the next step is to extract clinically meaningful entities from each note.

Entity Extraction & Deduplication

The goal of this stage is to transform unstructured clinical notes into structured data. Using a biomedical NER model, we extract the relevant entities, which are then normalized and deduplicated before building the relationships pairs.

Why only disease NER?

For this mini-project, I focused solely on disease entities, since they are prevalent in the generated clinical notes. This keeps the analysis coherent and allows us to highlight the relevance of algebraic connectivity without introducing the additional complexity of multiple entity types.

Model Selection

I selected a specialized NER model from OpenMed (see reference [1] for details), an excellent open-source collection of biomedical NLP models: OpenMed/OpenMed-NER-PathologyDetect-PubMed-109M, a small yet performant model that extracts Disease entities. This model balances speed and quality, making it well-suited for quick experimentation. With GPU acceleration (A100, 40GB), extracting entities from all 740 notes takes under a minute; while on CPU might take 3-5 minutes.

✨Insights: Using aggregation_strategy = "average" prevents word-piece artifacts (e.g., “echin” and “##ococcosis”), ensuring clean entity spans.

Entity Deduplication

Raw NER output is messy by nature: spelling variations, morphological variants, and near-duplicates all occur frequently (e.g. fever, low grade fever, fevers).

To address this issue, I applied a global fuzzy matching algorithm to deduplicate the extracted entities by clustering similar strings using RapidFuzz’s normalized Indel similarity (fuzz.ratio). Within each cluster, it selects a canonical name, aggregates confidence scores, counts merged mentions and unique patients, and returns a clean list of unique disease entities. This produces a clean set of diseases which is suitable for knowledge graph construction.

NLP Pipeline Summary

The pipeline consists of the following steps:

Data Loading: upload the dataset and drop records with empty notes.
Entity Extraction: apply the NER model to each note and collect disease mentions.
Deduplication: cluster similar entities using fuzzy matching and select canonical forms.
Canonical Mapping: to each extracted entity (text) assign the most frequent form as canonical_text.
Entity ID Assignment: generate unique identifiers for each deduplicated entity.
Relationships Builder: build the relationships connecting each patient_id to the canonical diseases extracted from its clinical_note.
CSV Export: export three clean files for Neo4j import.

With these structured inputs produced, we can now construct the Neo4j knowledge graph, detect patient communities and apply spectral graph theory.

The Knowledge Graph

Graph Construction in Neo4j

I built a bipartite knowledge graph with two node types Patient and Disease, connected by HAS_DISEASE relationships. This simple schema is sufficient to explore patient similarities and to extract communities information.

Figure 1. Patient–disease graph schema (author created).

I used Neo4j Desktop (version 2025.10.1), which offers full access to all Neo4j features and is ideal for small to medium-sized graphs. We will also need to install Graph Data Science (GDS) plugin, which provides the algorithms used later in this analysis.

To keep this section focused, I’ve moved the graph building outline to the project’s Github repository. The process takes less than 5 minutes using Neo4j Desktop’s visual importer.

Querying the Knowledge Graph

All graph queries used in this project can be executed directly in Neo4j Desktop or from a Jupyter notebook. For convenience, the repository includes a ready to run KG_Analysis.ipynb notebook with a Neo4jConnection helper class that simplifies sending Cypher queries to Neo4j and retrieving results as DataFrames.

Graph Analytics and Insights

The knowledge graph includes 739 patient nodes and 1,119 disease nodes, connected through 6,400 relationships. The snapshot below, showing a subset of 5 patients and some of their conditions, illustrates the graph structure:

Figure 2. Example subgraph showing five patients and their diseases (author created).

Examining the degree (rank) distribution (the number of disease relations per patient) we find an average of almost 9 diseases per patient, ranging from 2 to as many as 15. The left panel shows the morbidity, i.e. the distribution of diseases per patient. To understand the clinical landscape, the right panel highlights the ten most common diseases. There is a prevalence of cardiopulmonary conditions, which indicates the presence of large clusters centered on heart and lung complications.

Figure 3. Basic graph analytics (author created).

These basic analytics offer a glimpse into the graph’s structure. Next, we dive deeper into its topology, by identifying its connected components and analyzing communities of patients and diseases.

Community Detection

Connected Components

We begin by analyzing the overall connectivity of our graph using the Weakly Connected Components (WCC) algorithm in Neo4j. The WCC detects whether two nodes are connected via a path, regardless of the direction of the edges that compose the path.

We first create a graph projection with undirected relationships and then apply the algorithm in stats mode to summarize the structure of the components.

project_graph = '''
CALL gds.graph.project(
'patient-disease-graph',
['Patient', 'Disease'],
{HAS_DISEASE: {orientation: 'UNDIRECTED'}}
)
YIELD graphName, nodeCount, relationshipCount
RETURN graphName, nodeCount, relationshipCount
'''
conn.query(project_graph)

wcc_stats = '''
CALL gds.wcc.stats('patient-disease-graph')
YIELD componentCount, componentDistribution
RETURN componentCount, componentDistribution
'''
conn.query_to_df(wcc_stats)

The synthetic dataset used here produces a connected graph. Even though our graph contains a single component, we still assign each node a componentId for completeness and compatibility with the general case.

✨Insights: Using the allShortestPaths algorithm, we find that the diameter of our connected graph is 10. Since this is a bipartite graph (patients connected through shared diseases), the maximum separation between any two patients is 4 additional patients.

Community Detection Algorithms

Among the community detection algorithms available in Neo4j that do not require prior information about the communities, we narrow down to Louvain, Leiden, and Label Propagation. Leiden (see reference [3]), a hierarchical detection algorithm, addresses issues with disconnectedness in some of the communities detected by Louvain and is a superior choice. Label Propagation, a diffusion-based algorithm, could also be a reasonable choice; however, it tends to produce communities with lower modularity than Leiden and is less robust between different runs (see reference [2]). For these reasons, we use Leiden.

We then evaluate the quality of the detected communities using:

Modularity is a metric for assessing the quality of communities formed by community detection algorithms, typically based on heuristics. Its value ranges from −0.5 to 1, with higher values indicating stronger community structures (see reference [2]).
Conductance is the ratio between relationships that point outside a community and the total number of relationships of the community. The lower the conductance, the more separated a community is.

Detect Communities with Leiden Algorithm

Before applying the community detection algorithm, we create a graph projection with undirected relationships denoted largeComponentGraph.

To identify clusters of patients who share similar disease patterns, we run Leiden in write mode, assigning each node a communityId. This allows us to persist community labels directly in the Neo4j database for later exploration. To ensure reproducibility, we set a fixed random seed and collect a few key statistics (more statistics are calculated in the associated notebook). However, even with a fixed seed, the algorithm’s stochastic nature can lead to slight variations in results across runs.

leiden_write = '''
CALL gds.leiden.write('largeComponentGraph', {
writeProperty: 'communityId',
randomSeed: 16
})
YIELD communityCount, modularity, modularities
RETURN communityCount, modularity, modularities
'''
conn.query_to_df(leiden_write)

Leiden Results

The Leiden algorithm identified 13 communities with a modularity of 0.53. Inspecting the modularities list from the algorithm’s logs, we see that Leiden performed four optimization iterations, starting from an initial modularity of 0.48 and gradually improving with each step (the full list of values can be found in the notebook).

✨Insights: A modularity of 0.53 indicates that the communities are moderately well formed, which is expected in this scenario, where patients often share the same conditions.

A visual summary of the Leiden communities, is provided in the following combined visualization:

Figure 4. Overview of the Leiden communities (author created).

Conductance Evaluation

To assess how internally cohesive the Leiden communities are, we compute the conductance, which is implemented in Neo4j GDS. Lower conductance indicates communities with fewer external connections.

Conductance values in the Leiden communities range between 0.12 to 0.44:

Very cohesive groups: 0.12-0.20
Moderately cohesive groups: 0.24-0.29
Loosely defined communities: 0.35-0.44

This spread suggests structural variability across the detected communities, some with very few external connections while others have almost half of their connections pointing outwards

Interpreting the Community Landscape

Overall, the Leiden results indicate a heterogeneous and interesting community topology, with a few large communities of patients sharing common clinical patterns, several medium-sized communities and a set of smaller communities representing more specific combinations of conditions.

Figure 5. Leiden community 19: a speech and neurology focused cluster (author created).

For example, communityId = 19 contains only 9 nodes (2 patient nodes and 7 diseases) and is built around speech difficulties and episodic neurological conditions. The community’s conductance score of 0.41 places it among the most externally connected communities.

✨Insights: The two metrics we just analyzed, modularity and conductance, provide two different perspectives: modularity is an indicator for the presence of a community while conductance evaluates how well a community is separated from the others.

Spectral Analysis

In graph theory, the algebraic connectivity tells us more than just whether a graph is connected; it reveals how hard it is to break it apart. Before diving into results, let’s recall a few key mathematical concepts that help quantify how well a graph holds together. The algebraic connectivity and its properties were analyzed in detail in references [4] and [5].

Algebraic Connectivity and the Fiedler Vector

Background & Math Primer

Let G = (V, E) be a finite undirected graph without loops or multiple edges. Given an ordering of the vertices w1, … wn, the graph Laplacian is the nxn-matrix L(G) = [Lij] defined by

[\displaystyle {\rm L}_{ij} = \begin{cases} -1 & {\rm if } ; ({\rm w}_i, {\rm w}_j) \in {\rm E} ; {\rm and} ; {\rm i} \ne {\rm j}\ 0 & {\rm if } ; ({\rm w}_i, {\rm w}_j) \not\in {\rm E} ; {\rm and} ; {\rm i} \ne {\rm j} \ {\rm deg}({\rm w}_i) & {\rm if} ; {\rm i} = {\rm j}\end{cases}]

where deg(wi) represents the degree of the vertex wi.

The graph Laplacian can also be expressed as the difference L = D – A of two simpler matrices:

Degree Matrix D – a diagonal matrix with Dii = deg(wi).
Adjacency Matrix A – with Aij = 1 if wi and wj are connected, and 0 otherwise.

💡Note: The two definitions above are equivalent.

Eigenvalues and Algebraic Connectivity

For a graph with n vertices (where n is at least 2), let the eigenvalues of its Laplacian L(G) be ordered as

[0 = \lambda_1 \le \lambda_2 = {\rm a(G)} \le \lambda_3 \ldots \le \lambda_n]

The algebraic connectivity a(G) is defined as the second smallest Laplacian eigenvalue.

The Laplacian spectrum reveals key structural properties of the graph: – Zero Eigenvalues: The number of zero eigenvalues equals the number of connected components of the graph. – Connectivity Test: a(G) > 0 means the graph is connected, a(G)= 0 if and only if the graph is disconnected. – Robustness: Larger values of a(G) correspond to graphs that are more tightly connected; more edge removals are required to disconnect them. – Complete Graph: For a complete graph Kn, the algebraic connectivity is maximal: a(Kn) = n.

The Fiedler Vector

The eigenvector associated with the algebraic connectivity a(G) is known as the Fiedler vector. It has one component for each vertex in the graph. The signs of these components, positive or negative, naturally divide the vertices into two groups, creating a division that minimizes the number of edges connecting them. In essence, the Fiedler vector reveals how the graph would split if it were to separate it into two connected components by removing the smallest number of edges (see reference [8], Chp. 22). Let’s call this separation the Fiedler bipartition for short.

💡 Note: Some components of the Fiedler vector can be zero, in which case they represent vertices that sit on the boundary between the two partitions. In practice, such nodes are assigned to one side arbitrarily.

Next, we compute both the algebraic connectivity and the Fiedler vector directly from our graph data in Neo4j using Python.

Computation of Algebraic Connectivity

Neo4j does not currently provide a built-in functionality for computing algebraic connectivity, so we use Python and SciPy’s sparse linear algebra utilities to compute algebraic connectivity and the Fiedler vector. This is done via the FiedlerComputer class, which is described below:

FiedlerComputer class
1. Extract edges from Neo4j
2. Map node IDs to integer indices
- Build node-to-index and index-to-node mappings
3. Construct sparse graph Laplacian
- Build symmetric adjacency matrix
- Compute degree matrix from row sums of A
- Form Laplacian L = D – A
4. Compute spectral quantities
- Global mode: use all patient–disease edges
- Community mode: edges within one Leiden community
- Use `eigsh()` to compute the k smallest eigenvalues of L
- Algebraic connectivity = the second smallest eigenvalue
- Fiedler vector = the eigenvector corresponding to algebraic connectivity
5. Optional: write results back to Neo4j
- Store `node.fiedlerValue`
- Add labels FiedlerPositive / FiedlerNegative

The full implementation is included in the notebook KG_Analysis.ipynb in GitHub.

Computing the Algebraic Connectivity for a Sample Leiden Community

We illustrate the process using Leiden community = 14, consisting of 34 nodes and 38 edges.

Extract and validate edges. The constructor receives a Neo4j connection object conn that executes Cypher and returns Pandas DataFrames.

fc = FiedlerComputer(conn)
comm_id = 14
edges_data = fc.extract_edges(fc.query_extract_edges, parameters={'comm_id': comm_id})

Create node <–> index mappings. We enumerate all unique node IDs and create two dictionaries: node_to_idx (for building matrices) and idx_to_node (for writing results back).

direct, inverse, n_nodes = fc.create_mappings(edges_data)

>>node_to_idx sample: [('DIS_0276045d', 0), ('DIS_038a3ace', 1)]
>>idx_to_node sample: [(0, 'DIS_0276045d'), (1, 'DIS_038a3ace')]
>>number of nodes: 34

Build the graph Laplacian matrix. We build the Laplacian matrix from the graph data. For each undirected edge, we insert two entries, one for each direction, so that the adjacency matrix A is symmetric. We then create a sparse matrix representation (csr_matrix), which is memory-efficient for large, sparse graphs. The degree matrix D is diagonal, and it is computed via row sums of the adjacency matrix.

laplacian_matrix = fc.build_matrices(edges_data, direct, n_nodes)

>>Laplacian matrix shape: (34, 34)

Compute algebraic connectivity and the Fiedler vector. We use scipy.sparse.linalg.eigsh to compute the smallest few eigenvalue, eigenvector pairs of the Laplacian (up to k=4 for efficiency).

lambda_global, vector_global = fc.compute(mode="global")

>>Global λ₂ = 0.1102
>>Fiedler vector range: [-0.4431, 0.0081]

To compute the algebraic connectivity and the associated Fiedler vector for all Leiden communities:

results = fc.compute_all_communities().sort_values('lambda_2', ascending=False)

Since the number of communities is small we can reproduce all the results in the following table. For completeness the conductance computed in the previous section is also included:

Figure 6. Algebraic connectivity and conductance values for all Leiden communities (author created).

Algebraic connectivity values vary between 0.03 and 1.00 across the Leiden communities. The few communities with a(G) = 1 correspond to small, tightly connected structures, typically a single patient linked to several diseases.

At the other end of the spectrum, communities with very low a(G) (0.03 – 0.07) are loosely connected, often mixing multi-morbidity patterns or heterogeneous conditions.

✨Insights: Algebraic connectivity is a measure of internal coherence.

Labelling the spectral bipartition in Neo4j

Finally, we can write back the results to Neo4j, labeling each node according to the sign of its Fiedler vector component.

fc.label_bipartition(vector_comm, inverse)

>>Added Fiedler labels to 34 nodes
>>Positive nodes: 22
>>Negative nodes: 12

We can visualize this bipartition directly in Neo4j Explorer/Bloom.

Figure 7. Fiedler bipartition of Community 14 (author created).

In the visualization, the 12 nodes with negative Fiedler components appear in lighter colors, while the remaining nodes, with positive Fiedler components, are shown in darker tones.

Interpreting community 14 using the Fiedler vector

Community 14 contains 34 nodes (6 patients, 28 diseases) connected by 38 edges. Its conductance of 0.27 suggests a reasonably well-formed group, but the algebraic connectivity of a(G) = 0.05 indicates that the community can be easily divided.

By computing the Fiedler vector (a 34-dimensional vector with one component per node) and examining the Fiedler bipartition we observe two connected subgroups (as depicted in the previous image), containing 2 patients with negative Fiedler values and 4 patients with positive Fiedler values.

In addition, it is interesting to notice that the positive side diseases consist of predominantly ear-nose-throat (ENT) disorders, while on the negative side there are neurological and infectious conditions.

Ending Comments

Discussion & Implications

The results of this analysis show that community detection algorithms alone rarely capture the internal structure of patient groups. Two communities may share similar themes yet differ entirely in how their conditions relate to one another. The spectral analysis makes this distinction explicit.

For example, communities with very high algebraic connectivity (a(G) close to 1) often reduce to simple star structures, one patient connected to several conditions. These are structurally simple but clinically coherent. Mid-range connectivity communities tend to behave like stable, well-formed groups with shared symptoms. Finally, the lowest-connectivity communities reveal heterogeneous groups that consist of multi-morbidity clusters or patients whose conditions only partially overlap.

Most importantly, this work affirmatively answers the guiding research question: Can we layer graph algorithms with spectral methods to reveal clinically meaningful structure that traditional clustering cannot?

The goal is not to replace the community detection algorithms, but to complement them with mathematical insights from spectral graph theory, allowing us to refine our understanding of the clinical groupings.

Future Directions & Scalability

The natural questions that arise concern the extent to which these techniques can be applied in real-world or production settings. Although these methods can, in principle, be used in production, I see them primarily as refined tools for feature discovery, data enrichment, exploratory analytics, and uncovering patterns that may otherwise remain hidden.

Key challenges at scale include:

Handling sparsity and size: Efficient Laplacian computations or approximation methods (e.g. randomized eigensolvers) would be required for real-scale analysis.
Complexity considerations: Eigenvalue calculations are more expensive than community detection algorithms. Applying multiple layers of community detection to reduce the sizes of the graphs for which we compute the Laplacian is one practical approach that could help.

Promising directions for expansion include:

Extending the entity layer: Adding medications, labs, procedures would create a richer graph and more clinically realistic communities. Including metadata would increase the level of information, but also increase complexity and make interpretation harder.
Incremental and streaming graphs: Real patient datasets are not static. Future work could incorporate streaming Laplacian updates or dynamic spectral methods to track how communities evolve over time.

Conclusion

This project shows that combining community detection with spectral analysis offers a practical and interpretable way to study patient populations.

If you want to experiment with this workflow:

try different NER models,
change the entity type (e.g. use symptoms instead of diseases),
experiment with Leiden resolution parameter,
explore other community detection algorithms; a good alternative is Label Propagation,
apply the pipeline to open clinical corpora,
or just use a complete different domain or industry.

Understanding how patient communities form, and how stable they are, can support downstream applications such as clinical summarization, cohort discovery, and GraphRAG systems. Spectral methods provide a transparent, mathematically grounded toolset to explore those questions, and this blog demonstrates one way to begin doing that.

References

M. Panahi, OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets (2025), https://arxiv.org/abs/2508.01630.
S. Sahu, Memory-Efficient Community Detection on Large Graphs Using Weighted Sketches (2025), https://arxiv.org/abs/2411.02268.
V.A. Traag, L. Waltman, N.J. van Eck, From Louvain to Leiden: guaranteeing well-connected communities (2019), https://arxiv.org/pdf/1810.08473.
M. Fiedler, Algebraic Connectivity of Graphs (1973), Czechoslovak Math. J. (23) 298–305. https://snap.stanford.edu/class/cs224w-readings/fiedler73connectivity.pdf
M. Fiedler, A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory (1975), Czechoslovak Math. J. (25) 607–618. https://eudml.org/doc/12900
N.M.M. de Abreu, Old and new results on algebraic connectivity of graphs (2007), Linear Algebra Appl. (423) 53–73. https://www.math.ucdavis.edu/~saito/data/graphlap/deabreu-algconn.pdf
J.C. Urschel, L.T. Zikatanov, Spectral bisection of graphs and connectedness (2014), Linear Algebra Appl. (449) 1–16. https://math.mit.edu/~urschel/publications/p2014.pdf
S.R. Bennett, Linear Algebra for Data Science (2021) Book WebSite

Introduction

Introduction

Methodology Overview

Synthetic Data Generation

Entity Extraction & Deduplication

Why only disease NER?

Model Selection

Entity Deduplication

NLP Pipeline Summary

The Knowledge Graph

Graph Construction in Neo4j

Querying the Knowledge Graph

Graph Analytics and Insights

Community Detection

Connected Components

Community Detection Algorithms

Detect Communities with Leiden Algorithm

Leiden Results

Conductance Evaluation

Interpreting the Community Landscape

Spectral Analysis

Algebraic Connectivity and the Fiedler Vector

Background & Math Primer

Eigenvalues and Algebraic Connectivity

The Fiedler Vector

Computation of Algebraic Connectivity

Computing the Algebraic Connectivity for a Sample Leiden Community

Labelling the spectral bipartition in Neo4j

Interpreting community 14 using the Fiedler vector

Ending Comments

Discussion & Implications

Future Directions & Scalability

Conclusion

References

Similar Posts