Enhancing link prediction in biomedical knowledge graphs with BioPathNet

Main

Biological entities interact in complex ways, crucial for sustaining life1. Understanding these interactions is central to systems biology, with network analysis playing a key role2. Biological networks are represented as graphs, where nodes can represent genes, proteins, diseases and more, and edges denote associations between them. Edges in a biological graph can signify co-regulation between genes or causal relationship (regulatory network)3,4, physical interactions (in protein–protein interaction networks (PPI)5,6), as well as disease–gene associations (such as in disease–gene networks7,8), among many.

Despite increasing high-throughput experiments, our grasp of biological networks is incomplete, leaving many interactions undiscovered. Due to the expense and time involved in wet lab experiments, computational methods such as link prediction (LP) are very important for inferring potential associations within these networks based on the underlying topology9. LP is applied across network biology for diverse tasks ranging from predicting protein interactions to inferring gene regulatory networks10. By revealing hidden connections, LP facilitates the discovery of biomarkers, drug targets and insights into biological interactions11,12. To predict potential relationships between unconnected nodes, one prevalent class of methods uses similarity metrics from traditional graph analysis, such as Personalized PageRank, Jaccard or Katz index13,14. These metrics have been used for predicting disease–gene associations15, including non-coding RNA–disease relationships and drug–disease associations16.

While traditional graph metrics have been successful in biological LP, representation-based learning offers greater expressiveness for capturing the nuances and complexity of nodes in a graph. Nodes are mapped to low-dimensional vector representations called embeddings using shallow and deep nonlinear transformations. Optimized embeddings position nodes with similar network neighbourhoods closely in the embedding space so that links between nodes can be predicted on the basis of their similarity in this space17. Methods include matrix factorization (for example, Mashup18) and random walk approaches (for example, DeepWalk19, node2vec20, struc2vec21).

Network embedding techniques have been successful in drug repurposing, adverse drug reaction prediction, gene function prediction and protein–protein interaction network completion22,23,24,25. For instance, ref. 26 introduced the multiscale interactome, integrating disease-associated proteins, drug targets and biological functions using biased random walks for node embeddings. GeneWalk predicts gene functions via network representation learning with random walks27. Reference 28 constructed a multimodal network of genes and polygenic risk scores for diseases to uncover associations between COVID-19 genes, co-morbidities and genetic predispositions28,29,30.

Recent efforts use heterogeneous multirelational networks, more specifically knowledge graphs (KGs), to better represent biological complexities by modelling facts as subject–predicate–object (SPO) triplets. KG research is increasingly applied to tasks such as question answering and information retrieval, with a key challenge being LP by estimating missing triplet components. Knowledge Graph Embedding (KGE) effectively learns low-rank representations of entities and relations, preserving graph structure and encoding relation semantics by optimizing a training loss that maximizes scores for positive triplets while minimizing those for corrupted triplets31. KGE methods, such as TransE31, DistMult32, ComplEx33 and RotatE34, have shown notable performance in several benchmark tests, but they are often limited to one-hop relations. In large biomedical KGs, relationships between entities are intricate and may involve multihop paths. Methods such as SEAL35 and Grail36 address this by embedding subgraph structures around links, but their scalability is limited due to the need to generate subgraphs for each link, creating bottlenecks in large-scale predictions.

Unlike KGEs, deep learning methods—particularly graph convolutional networks (GCNs)37, a class of message passing neural networks (MPNNs)—have substantially advanced KGEs38,39. These models have greatly improved performance in biological tasks such as drug–target interaction prediction40,41, polypharmacy side effect modelling25, tissue-specific protein function prediction23,24 and drug–disease indication discovery42. In addition, they have been applied to multirelational and heterogeneous graphs to capture diverse entity association semantics.

For example, the relational graph convolutional network (R-GCN) for multirelational KGs38 learns node embeddings by aggregating transformed feature vectors of neighbouring nodes via a normalized sum and uses the DistMult factorization model for LP. While R-GCN incorporates relation-specific transformations based on edge type and direction, making it well suited for multirelational data in KGs, it does not explicitly learn a representation for relations. More expressive methods include relation-aware graph attention network (RAGAT), a GAT-based model that utilizes relation-specific network parameters to learn embeddings for both entities and relations, combining them into joint embeddings for each relation type43. Similarly, the heterogeneous graph transformer (HGT) models graph heterogeneity by employing node- and edge-type-dependent parameters, including a node- and edge-specific attention mechanism. This approach preserves distinct representations for nodes and edges while leveraging its architecture to incorporate information from higher-order neighbourhoods to be used in downstream tasks44.

With the rapid accumulation of biomedical data, understanding disease biology and molecular factors’ roles in phenotypic outcomes is crucial for personalized diagnostics and treatments. KGs have also become the dominant knowledge representation in biomedicine, leveraging databases such as UniProt45, Gene Ontology46,47 and DrugBank48. LP tasks in biomedical KGs, such as the COVID-19 drug candidate exploration49 with RotatE and DistMult, OntoProtein’s Gene Ontology-based KG for protein language model pretraining50, and the node-embedding algorithms for multimodal biomedical KGs51, enhance drug discovery and predict disease co-morbidities. Further, task-specific KGs and frameworks such as BioCypher52 further support KG construction, aiding predictive modelling for drug adverse reactions, repurposing and biological concept associations. One of the latest state-of-the-art biomedical KG, PrimeKG, integrates 20 primary resources including DisGeNet, MONDO Disease Ontology and DrugBank53, encompassing 17,080 diseases and 4 million relationships. It includes diverse disease-associated data such as protein perturbations, biological processes, pathways, phenotypes and drug therapeutic actions, including ‘indications’ and ‘contraindications’ drug–disease edges. The method TxGNN54 leverages PrimeKG via the R-GCN framework, to perform zero-shot inference of drug–disease indications and contraindications on diseases with no known treatment and minimal molecular understanding, efficiently handling therapeutic tasks through adaptive aggregation and iterative optimization of embeddings.

On the basis of the assumption that representing head entities in a graph while considering their path to specific tail nodes might enhance representation learning and lead to more accurate LPs, researchers started developing representation learning frameworks for LP based on the paths between two nodes. The first application of this concept is the study from ref. 55, which introduce KG4SL, a graph neural network (GNN) model that integrates KG message-passing for synthetic lethality (SL) prediction, leveraging a KG with 11 entity types and 24 relevant relationships associated with SL. Further, the neural Bellman–Ford network (NBFNet) introduces a framework for LP inspired by traditional path-based methods56. It represents node pairs as the sum of all path representations (of a given length), each derived from edge representations, and it employs a graph neural network with learned operators for efficient path formulation solutions, scalable to large graphs with low time complexity. NBFNet works with both homogeneous and multirelational graphs, supporting LP across different graph types. Combining traditional path-based methods with GNNs, NBFNet demonstrates superior performance compared with node-embedding methods. In addition, path-embedding methods offer better interpretability by visualizing important paths used for prediction, facilitating verification of biological plausibility.

To tackle LP in biological KGs, we introduce BioPathNet, a message-passing neural network for path representation learning, built on NBFNet. As opposed to node-embedding approaches, BioPathNet uses path-based reasoning to learn representations between source and target nodes on the basis of relations along the path. We introduce key design choices in BioPathNet to adapt and extend the NBFNet algorithm for LP tasks on biomedical KGs, which are inherently ‘noisy’ due to false positives from experiments, data sparsity, redundant edges and relations, and knowledge bias (for example, well-studied disease genes being overrepresented and highly connected). A crucial enhancement is the incorporation of a background regulatory graph (BRG) for message passing, which not only improves BioPathNet’s performance but also enhances its scalability. This makes biomedical LP feasible for tasks such as drug repurposing, where running NBFNet out of the box is computationally prohibitive. In addition, we implement a biological entity-type-aware negative sampling scheme, which can be combined with adaptive sampling based on local graph density. This improves decision boundary accuracy across tasks and further boosts BioPathNet’s performance. We evaluated BioPathNet on diverse biomedical KGs with varying network properties, relation types and node types, demonstrating its effectiveness in LP across multiple tasks, including gene function prediction, zero-shot disease–drug target interaction prediction, SL gene pair identification and long non-coding RNA (lncRNA)–gene regulatory inference. We show that the BRG not only improves BioPathNet’s scalability over NBFNet but also enhances performance across all tasks. In addition, the node-type-aware (NTA) negative sampling scheme improves decision boundary accuracy and further boosts performance in most LP tasks.

BioPathNet was benchmarked against the original NBFNet, three KGE baselines and seven state-of-the-art methods, including task-agnostic models for heterogeneous graphs and long-range relationship learning, as well as LP task-specific approaches. We show that BioPathNet consistently outperforms these methods across tasks or, in a few cases, achieves comparable performance.

Finally, we showcase BioPathNet’s ability to naturally interpret predictions through biological pathways and uncover biological knowledge. We illustrate this with two examples in disease–drug indication prediction, focusing on acute lymphoblastic leukaemia (ALL) and Alzheimer’s disease (AD).

Results

BioPathNet: path embedding as alternative to node embedding for graph completion on biomedical KGs

A KG is a heterogeneous directed graph comprising various types of entity (nodes) connected by relationships (edges). For instance, nodes might represent diseases, genes and drugs, and relationships such as ‘indication for’ or ‘involved in’. KGs are typically structured as triplets (head node, relationship, tail node), such as (drug A, indication, disease B) or (gene C, involved in, disease D). The task of KG completion involves estimating the missing components of these triplets. For example, one might predict the tail entities given a head entity and a specific relationship, such as predicting drugs as indications for diseases, on the basis of existing triplets (that is, existing knowledge). KG completion methods can be broadly categorized into embedding-based and path-based approaches (Fig. 1a). Embedding-based approaches encode entities in a KG into lower-dimensional spaces, preserving graph structure by minimizing distances or maximizing similarities between head, relation and tail embeddings.

Fig. 1: Link prediction in biological KGs.

a, Node embedding vs path-representation learning. b, Illustration of how the generalized Bellman–Ford algorithm operates in NBFNet and BioPathNet to solve the shortest path problem between a head (h) and tail (t) entity through specific relationships (r). The method leverages message-passing GNNs to learn path representations, while a subsequent multilayer perceptron differentiates between positive and negative relationships (more details in Methods). c, Illustration of BioPathNet’s first design choice: integrating a BRG to enrich gene connections, improving message passing and information flow beyond supervised training edges. d,e, Illustration of prediction paths between head nodes (blue) and tail nodes (orange) in two scenarios: without BRG (d) and with BRG connections (e). f, Illustration of the NTA negative sampling procedure, where negative triplets are sampled only if their node types match the positive triplets, compared to a non-type-aware approach, which samples negatives uniformly without considering tail node type. g, Illustration of the SANS procedure, where negative triplets are selected on the basis of both the node type of positive triplets and the local graph density surrounding each positive triplet.

A path embedding-based LP approach, instead, captures the structural information of KGs by learning representations for pairs of nodes (instead of single nodes) through paths. Intuitively, path embeddings capture logical relationships between head and tail nodes in a graph. Instead of explicitly enumerating rules, as in traditional logic-based LP, they represent paths as high-dimensional vectors and leverage neural networks for a nonlinear learning process. NBFNet56 learns node pair representations by parameterizing them as the generalized sum of path representations, with each path representation as the generalized product of edge representations along the path (Figs. 1b and 7). This path formulation can be efficiently solved using the generalized Bellman–Ford algorithm based on dynamic programming. Moreover, the efficiency is further enhanced by learning the operators of the generalized Bellman–Ford algorithm with a message-passing graph neural network (see Methods).

BioPathNet extends the NBFNet framework to address the unique challenges of biomedical KGs, which are large, incomplete, prone to noisy relations due to experimental errors or false positive assignments, and enriched with diverse yet often redundant information. First, on top of a primary KG (G1), for which we want to learn and generate predictions, it introduces a secondary KG, the BRG or G2. The BRG serves the goal of improving entity connectivity during message passing (Fig. 1c–e) and thereby enriching the representation of triplets through the incorporation of additional paths and relationships, without affecting loss computation, guaranteeing a scalable solution also with very large KGs. In fact, unlike NBFNet, which targets general graph completion, BioPathNet optimizes the model’s loss only on the links of interest, that is, training edges in G1 (and corresponding negative triplets). Predictions can be made without or with leveraging a BRG during the message-passing step (see Methods) BRG (Fig. 1c). For example, as illustrated in Fig. 1d, when predicting the missing link between a head node and a tail node, messages can be passed between type 1 and type 2 nodes, resulting in a certain prediction path (Fig. 1d). Alternatively, as illustrated in Fig. 1e, a BRG can be integrated, and predictions are made going through the relation edges involving node types 2 and 3. The integration of a BRG in BioPathNet enhances both prediction accuracy and interpretability, demonstrated in the following sections, by enriching the representation of the primary task. This enables the exploration of subnetworks, such as interaction partners around specific node pairs, facilitating the derivation of mechanistic hypotheses from the broader biological context. BioPathNet also introduces a stricter NTA negative sampling strategy (Fig. 1f), where negative triplets are drawn with preserved tail node types, allowing not only to improve the model’s decision boundary but also ensuring biologically meaningful sampling. In addition, to improve decision boundary, we integrate the SANS strategy from ref. 57 into our NTA sampling (Fig. 1g), accounting for local graph structure and density instead of uniformly sampling negative triplets.

BioPathNet accurately predicts links across different biomedical tasks

Overview of the different LP tasks

We evaluate BioPathNet on biomedical LP across four distinct tasks, framing the problem as learning the distribution of logical rules from a KG, and use the learned rules to perform inference, that is, predict new links on the same graph (transductive setting) or on a new graph (inductive setting). We evaluate BioPathNet across diverse biomedical tasks, testing its adaptability to KGs of varying size, complexity, sparsity, noise and bias.

The first LP task we evaluate with BioPathNet is ‘gene function prediction’, which consists of assigning biological terms, such as molecular pathways, to specific genes (Fig. 2a). We use as G1, a KG linking genes to pathway terms from KEGG via the ‘function of’ relation, sourced from ConsensusPathDB58 and consisting of 32,000 edges (Supplementary Table 1). G1 is then split into training, validation and testing edges. During training, training triplets are used for message passing (that is, referred to as training graph from G1), while removing the specific triplets of the batch for learning path representations. This graph is also used during validation and testing, in addition to a potential G2/BRG (see below). In this context, if gene x has functions y and z, and gene w is assigned to function z, then gene w is also likely to have function y, and we can think of BioPathNet as learning logical rules such as: ‘function_of(y, x) ∧ function_of(z, x) ∧ function_of(z, w) → function_of(y, w)’ (Fig. 2a). Introducing a BRG for message passing (G2) (connected to the original G1, extracted from Pathway Commons59,60,61 encompassing ~1.8 million edges and 13 different relation types between genes and chemicals, including PPIs (Supplementary Tables 1 and 2)) allows BioPathNet to infer gene–function links through alternative paths, enabling the learning of more expressive rules. For example, if gene x has function k, interacts with gene k in a PPI network, and gene z phosphorylates gene w, then gene w is also likely to be assigned to function y. This can be expressed as the following logical rule: ‘function_of(y, x) ∧ interact(x, k) ∧ phosphorylation(k, w) → function(y, w)’ (Fig. 2a).

Fig. 2: Overview of the four LP tasks across different KGs and impact of BioPathNet’s design choices on performance.

a, Schematic of the gene function prediction KG, linking genes, chemicals and cellular pathways. Training triplets (G1: gene–pathway links extracted from KEGG) are in black; BRG edges (extracted from Pathway Commons for message passing) are in red. b, Simplified schematic of the PrimeKG graph for disease–drug indication prediction, connecting, for example, genes, diseases, drugs and phenotypes. G1 edges (disease–drug links) are in black; BRG edges (all other relations) are in red. c, Schematic of synthetic lethality KG for gene pair prediction. G1 edges (SL links) are in black; BRG edges are in red and include additional SL interactions alongside associated nodes (cellular components, molecular functions, biological processes and pathways). d, LncRNA-mediated regulation KG from LncTarD 2.0, with BRG from Pathway Commons. The graph features six node types (lncRNAs, microRNAs, mRNAs, pseudogenes, transcription factors and proteins), with regulatory relationships in black and BRG protein interactions in red. e,f, Ablation study results on the NTA and SANS (e) and on the BRG (f) showing performance changes vs original BioPathNet, that is, increases in green and decreases expressed as percentages in red across five seeds for each experimental task.

The next LP task predicts disease–drug indications by repurposing existing drugs for new conditions using PrimeKG (Supplementary Fig. 1a)53 (see Methods). This task is more challenging than the previous one, as it is conducted in a zero-shot setting—where the target disease has minimal molecular characterization and no existing treatments—following the approach from TxGNN54, a state-of-the-art node- and relation-based embedding method for this task (see Methods). We applied BioPathNet to five disease areas: ‘adrenal gland,’ ‘anaemia,’ ‘cardiovascular,’ ‘cell proliferation,’ and ‘mental health’, splitting the task into corresponding subtasks. BioPathNet was trained on drug–disease edges (for example, ‘indication’, ‘contraindication’) of all other diseases except for the disease area of interest, while the remaining PrimeKG graph served as the BRG (G2) for message passing. A BRG with ~5.7 million edges and 30 relation types for message passing, including drug–drug, protein–protein and disease–disease interactions from PrimeKG, was solely used for message passing (Supplementary Table 3). In this zero-shot prediction context (Fig. 2b), if drug x is indicated for disease y, which shares the same disease protein k with disease w, then drug x is probably an indication for w: ‘indication(x, y) ∧ disease_protein(y, k) ∧ disease_protein(w, k) → indication(x, w)’.

The third LP task focuses on predicting links between synthetic lethality gene pairs (Fig. 2c). SL occurs when the simultaneous mutation of two genes leads to cell death, while the mutation of either gene alone is non-lethal62, and this is of high interest in anti-cancer therapies. Data from SynLethDB-v.2.0, pre-processed by KR4SL55, includes SL pairs curated from biochemical assays, databases, predictions and text mining, with confidence scores prioritizing experimental evidence. The G1 graph contains ~20,000 gene–gene SL edges, and learning logical rules in this setting (without a BRG) corresponds to, for example: ‘synth_leth(x, z) ∧ synth_leth(z, y) → synth_leth(x, y)’, that is, if x and z are an SL pair and z and y are an SL pair, then x and y are likely to be an SL pair. When including a BRG (G2), this adds ~380,000 connections across 48 relation types, including additional SL links and gene–Gene Ontology (GO) term association links under ‘molecular function’, ‘cellular process’ and ‘cellular compartment’ categories on top of the training graph from G1. In this case, BioPathNet can learn more sophisticated rules of the type: ‘synth_leth(x, z) ∧ contributes_to(z, w) ∧ contributes_to(y, w) → synth_leth(x, y)’, where a path through genes z and y and their shared molecular function k under the relation ‘contributes_to’ is chosen to infer an SL relationship between x and y.

The final LP task applies BioPathNet to predicting regulatory interactions between long non-coding RNAs (lncRNAs) and genes (Fig. 2d). LncRNAs—transcripts over 200 nucleotides long that do not encode proteins—play essential roles in gene regulatory networks and are implicated in numerous diseases. However, most remain functionally non-characterized despite genomic consortia63 estimating over 200,000 potential transcripts. Identifying lncRNA targets is a crucial step in understanding their functions, and predictive methods are invaluable in this process to guide experiments. The G1 graph for this task was built from regulatory lncRNA–target (gene–mRNA) interactions sourced from LncTarD (2.0)64, which contains ~6,000 experimentally validated lncRNA–target interactions in human diseases, categorized into seven regulatory mechanisms (including ‘epigenetic’, ‘transcriptional’ and ‘sponge regulation’) (Supplementary Tables 4 and 5). The BRG (G2), sourced from Pathway Commons, is the same G2 used for gene function prediction. Since small chemical molecules lack direct links to lncRNAs, we included only the PPI component, resulting in ~1.1 million edges and 7 relation types. In this context, if lncRNA x regulates gene z, and gene z interacts with gene y in a PPI network (Fig. 2d), it is likely that lncRNA x also regulates gene y, and the corresponding logical rule can be formulated as: ‘lncRNA_regulation(x, z) ∧ interact(z, y) → lncRNA_regulation(x, y)’.

For each of the four LP tasks, BioPathNet was trained, validated and tested on parts of the data of G1 (see Methods). Performance was evaluated using the NTA negative sampling scheme, with and without G2 for message passing. In more detail, for gene function prediction, BioPathNet was trained on gene–function triplets, with negatives generated only from non-occurring gene–function pairs (same node type as positives), excluding gene–gene, function–function or gene–compound interactions to preserve positive node types. For drug repurposing, BioPathNet was trained on drug–disease indication triplets, on each disease split separately as in TxGNN54, to simulate zero-shot conditions, a scenario where no approved drugs for the disease of interest were seen in training. These splits provide challenging yet realistic evaluation scenarios, mimicking zero-shot drug repurposing (see Methods). Negatives were sampled exclusively from non-occurring drug–disease pairs, excluding gene–gene, disease–disease, gene–GO term and other interactions to preserve positive node types. For the SL task, BioPathNet was trained in both transductive and inductive settings on: (1) gene–gene pairs with an SL confidence score above 0.3 (thresholded data, filtered by confidence score, see Methods) and (2) gene–gene SL pairs regardless of confidence score (unthresholded data). This allowed us to evaluate the impact of noisy labels on model performance. Negatives were sampled exclusively from non-occurring gene–gene lethality pairs, avoiding gene–GO term or other cross-type triplets. For this task, we also trained BioPathNet in an inductive setting, predicting SL links between gene pairs on an inference graph distinct from the training graph, following a split based on nodes of Knowledge Representation for Synthetic Lethality (KR4SL). This setup allowed us to assess BioPathNet’s ability to generalize to unseen data. For lncRNA target prediction, BioPathNet was trained on triplets where the head node is an lncRNA and the tail node a target of various types (for example, microRNA, protein-coding gene, another lncRNA or other; see Fig. 2d), linked by a regulatory mechanism (for example, ‘epigenetic regulation’ or ‘sponge interaction’). Negatives were generated as above by corrupting positive lncRNA–target triplets while preserving node types; for instance, if a positive triplet involved an lncRNA and a protein-coding gene, only non-occurring lncRNA–protein interactions were sampled. We report for each task the mean reciprocal rank (MRR), which measures how well the model ranked correct pairs among negatives, and Hits@k (k = 1, 3, 10), which quantifies the proportion of positive triplets among the top-k predictions—both standard metrics for KG–LP problems (Table 1).

BioPathNet’s performance and evaluation of design choices

Table 1 summarizes the results of KG completion across the four tasks, demonstrating that BioPathNet achieves an MRR consistently better than random for all LP tasks (see Methods), indicating strong predictive capability. The model performs particularly well in drug–disease indication prediction for the disease categories of adrenal gland, anaemia and cell proliferation, achieving an MRR above 0.4 in all three cases and Hits@10 scores ranging from 58% to 100%. Performance is slightly lower for other disease categories, such as ‘mental health’, with ‘cardiovascular’ diseases posing the greatest challenge, yielding an MRR of 0.2 and a Hits@10 score of 30%. While the drug repurposing tasks primarily focus on predicting disease–drug indications, BioPathNet also performs well in identifying disease–drug contraindications, with an average MRR of 0.53 (Supplementary Table 6). BioPathNet performs well on the gene function prediction task, achieving an MRR of 0.6 and a Hits@10 score of 72%. For the SL pair prediction task, performance improved substantially when low-confidence SL edges were filtered out from the graph, with MRR increasing from 0.31 to 0.36 and Hits@10 rising from 44% to 50%. Finally, for the lncRNA–target prediction task—one of the most challenging due to graph sparsity and relational uncertainty—BioPathNet achieved an MRR of 0.19, outperforming random predictions. These results underscore BioPathNet’s versatility in handling diverse LP tasks across different biomedical KGs, regardless of task-specific complexities.

To assess the impact of key design choices, such as BioPathNet’s NTA negative sampling or SANS, as well as the use of a BRG, we conducted ablation studies. We compared performance with and without NTA and SANS and evaluated BioPathNet with and without a BRG for message passing alone on all four benchmarked tasks. BioPathNet benefits significantly from incorporating NTA sampling and BRG (Supplementary Tables 6–9). In the gene prediction task, sampling negatives from the same node type as the tail node improves MRR by 4% and Hits@k by 2–6.2% compared to random sampling (Supplementary Table 7). Similar gains are observed for SL and lncRNA–target KGs (Supplementary Tables 8 and

Main

Main

Results

BioPathNet: path embedding as alternative to node embedding for graph completion on biomedical KGs

BioPathNet accurately predicts links across different biomedical tasks

Overview of the different LP tasks

BioPathNet’s performance and evaluation of design choices

Similar Posts