Not All Nodes Are Equal: Rethinking Knowledge Distillation for Graphs

12 minute read

Published: January 03, 2025

Not All Nodes Are Equal: Rethinking Knowledge Distillation for Graphs

Here’s a question that seems obvious in hindsight: when you’re transferring knowledge from a powerful Graph Neural Network (GNN) to a simpler Multi-Layer Perceptron (MLP), should you treat every node in the graph equally?

Most existing methods say yes. We said no — and it turns out this makes a big difference.

In this post, I’ll walk you through InfGraND, our influence-guided approach to GNN-to-MLP knowledge distillation that was recently published in Transactions on Machine Learning Research (TMLR). We’ll cover the intuition, the key ideas, and why asking “how important is this node to the graph structure?” leads to better results than asking “how confident is the teacher about this node?”

The Deployment Problem: GNNs Are Great, But…

Graph Neural Networks have become the go-to tool for learning on graph-structured data. Social networks, recommendation systems, citation networks, molecular structures — GNNs handle all of these remarkably well. More recently, they’ve become a crucial component in Retrieval-Augmented Generation (RAG) systems, providing external knowledge to Large Language Models.

The secret sauce? Message passing. At each layer, every node aggregates information from its neighbors:

\[h_i^{(l)} = \text{UPDATE}^{(l)}\left(h_i^{(l-1)}, \text{AGGREGATE}^{(l)}\left(\{h_j^{(l-1)} : v_j \in \mathcal{N}(v_i)\}\right)\right)\]

This is powerful — nodes learn representations that capture both their own features and their structural context. But here’s the catch: this recursive neighborhood aggregation is computationally expensive. For every prediction, you need to fetch neighbors, aggregate their features, and propagate information through multiple layers.

In production environments where latency matters — think real-time fraud detection or live recommendation systems — this overhead becomes a serious bottleneck.

Figure 1: GNN inference requires recursive neighborhood aggregation, while MLP inference operates directly on node features without graph operations.

The Simple Alternative: MLPs

Multi-Layer Perceptrons don’t have this problem. They take node features as input and produce predictions directly — no neighbor lookups, no message passing, just straightforward matrix multiplications. They’re fast and easy to deploy.

The downside? MLPs completely ignore graph structure. They treat each node as if it exists in isolation, which means they miss all the rich relational information that makes graphs useful in the first place.

Here’s the performance gap in practice: on the Cora citation network, a vanilla MLP achieves around 58% accuracy, while a GCN reaches about 82%. That’s a massive difference.

Bridging the Gap: Knowledge Distillation

Knowledge distillation offers an elegant solution. The idea is simple:

Train a powerful GNN teacher on the graph
Use the teacher’s predictions (soft labels) to train a lightweight MLP student
Deploy only the MLP at inference time

The student MLP learns to mimic the teacher’s behavior, effectively inheriting some of the structural knowledge without needing to perform message passing at inference time. This approach, pioneered by GLNN (Graph-Less Neural Networks), shows that distilled MLPs can significantly close the gap with their GNN teachers.

But here’s where it gets interesting: how exactly should we transfer this knowledge?

The Question Everyone Was Asking (And Why It’s Wrong)

Most knowledge distillation methods treat all nodes uniformly — every node contributes equally to the training loss. Some recent work recognized this might be suboptimal and introduced non-uniform approaches. Methods like KRD and HGMD use prediction entropy to discriminate between nodes: nodes where the teacher is less confident get more attention during training.

The reasoning seems sound: uncertain predictions are “harder” samples, so we should focus on them.

But think about this for a moment. Entropy measures how confident the teacher GNN is about a node’s label. It doesn’t tell you anything about the node’s role in the graph structure. A node could have high entropy simply because its features are ambiguous, not because it’s structurally important.

This led us to ask a different question: “How influential is this node within the structure of the graph?”

The Pebble in the Pond

Here’s an intuition that helped us think about node influence:

Imagine dropping pebbles into a pond. Some pebbles create ripples that spread far across the water; others barely disturb the surface. The size and reach of the ripples depend on where you drop the pebble and how the water flows.

Graphs work similarly. When you perturb a node’s features, that perturbation propagates through message passing to affect other nodes’ representations. Some nodes, by virtue of their position in the graph, have perturbations that ripple far and wide. Others have more localized effects.

We call this “influence” — and it’s fundamentally a structural property, not a prediction confidence property.

Figure 2: The pebble-in-pond analogy. High-influence nodes create “ripples” that spread through the graph, affecting many other nodes’ representations. Low-influence nodes have more localized effects.

Measuring Node Influence

How do we actually quantify this? Formally, we define the influence of a source node $v_i$ on a target node $v_j$ after $k$ message-passing iterations as:

\[\hat{I}_{(j \leftarrow i)}(v_j, v_i, k) = \left\| \mathbb{E}\left[\frac{\partial \mathbf{x}_j^{(k)}}{\partial \mathbf{x}_i^{(0)}}\right] \right\|_1\]

In plain terms: we’re measuring how much the target node’s representation changes when we perturb the source node’s initial features. The Jacobian captures this sensitivity, and the L1-norm gives us a scalar measure.

Computing this exactly is expensive, but we can approximate it efficiently. Following the insight from Simplified Graph Convolutional Networks (SGC), we remove non-linearities and weight matrices to focus on pure topological propagation:

\[\mathbf{X}^{(k)} = \tilde{\mathbf{A}}\mathbf{X}^{(k-1)}\]

where $\tilde{\mathbf{A}}$ is the normalized adjacency matrix. After $k$ propagation steps, we use cosine similarity between the original features $\mathbf{x}_i^{(0)}$ and the propagated features $\mathbf{x}_j^{(k)}$ as our influence indicator.

The beauty of this approach? It’s parameter-free and computed only once as a preprocessing step — no overhead during training or inference.

To get a single importance score per node, we aggregate pairwise influences into a Global Influence Score:

\[I_g(v_i) = \frac{\sum_{j \in V} I_{(j \leftarrow i)}(v_j, v_i, k)}{\max_{l \in V} \sum_{j \in V} I_{(j \leftarrow l)}(v_j, v_l, k)}\]

This tells us how much each node influences the entire graph, normalized to lie between 0 and 1.

Does Influence Actually Matter?

Before building a whole framework around influence, we wanted to verify our hypothesis: does training on high-influence nodes actually lead to better models?

We ran a simple experiment. For each dataset, we split the training nodes into two groups: the top 25% by influence score (high-influence) and the bottom 25% (low-influence). Then we trained separate GNNs on each subset using the same test set.

Figure 3: Test accuracy of GNNs trained on high-influence vs low-influence nodes. Models trained on high-influence nodes consistently outperform those trained on low-influence nodes across all datasets.

The results were consistent across all datasets and GNN architectures: models trained on high-influence nodes significantly outperformed those trained on low-influence nodes. This validated our core intuition — influence captures something meaningful about which nodes matter most for learning.

InfGraND: Putting It All Together

With influence validated as a useful signal, we built InfGraND (Influence-Guided Graph Knowledge Distillation) around two main components:

1. Influence-Guided Distillation Loss

Instead of treating all nodes equally, we weight the distillation loss by influence scores. For each node $i$, we encourage its representation to match the teacher’s predictions for its neighbors $j$, weighted by how influential those neighbors are:

\[\mathcal{L}_d = \sum_{i \in V} \sum_{j \in \mathcal{N}(v_i)} (\gamma_1 + \gamma_2 \cdot I_g(v_j)) \cdot \frac{1}{|\mathcal{N}(v_i)|} \cdot D_{KL}(\sigma(\mathbf{h}_i^s / \tau) \| \sigma(\mathbf{h}_j^t / \tau))\]

Let’s break this down:

$\gamma_1$ provides a baseline gradient from all neighbors (we don’t completely ignore low-influence nodes)
$\gamma_2 \cdot I_g(v_j)$ amplifies the signal from high-influence neighbors
$D_{KL}$ is the KL divergence between student and teacher predictions
$\tau$ is the distillation temperature

The key insight: high-influence neighbors provide stronger supervision signals. The student learns to prioritize getting these nodes right.

Figure 4: Influence-weighted knowledge distillation. The student node receives stronger supervision signals from high-influence neighbors (thick edges) compared to low-influence neighbors (thin edges).

2. One-Time Feature Propagation

To give the MLP some structural awareness without adding inference overhead, we pre-compute multi-hop neighborhood features:

\[\tilde{\mathbf{X}} = \text{POOL}\left(\{\mathbf{X}^{(p)}\}_{p=0}^{P}\right)\]

We propagate features through the graph structure for $P$ hops, then average-pool across hops. This enriched feature matrix $\tilde{\mathbf{X}}$ becomes the input to our MLP.

The critical point: this is computed once before training and stored. At inference time, the MLP just uses these pre-computed features — no graph operations needed.

The Complete Objective

The student MLP is trained with a combination of supervised loss (on labeled nodes) and distillation loss (on all nodes):

\[\mathcal{L}_t = \lambda \mathcal{L}_s + (1 - \lambda) \mathcal{L}_d\]

Both losses incorporate influence weighting, ensuring that structurally important nodes guide both the ground-truth learning and the knowledge transfer from the teacher.

Results: Does It Work?

We evaluated InfGraND across seven benchmark datasets in both transductive (train and test on the same graph) and inductive (train on one graph, test on another) settings.

Key Findings

InfGraND consistently outperforms baselines. Across different teacher architectures (GCN, GAT, GraphSAGE) and datasets, InfGraND achieves the highest accuracy in most configurations.

The distilled MLP often beats its teacher. This might seem counterintuitive, but it happens regularly. On Amazon-Photo with a GCN teacher, the teacher achieves 90.7% accuracy while the InfGraND student reaches 94.2%. The distillation process, guided by influence, helps the student generalize better than the teacher in many cases.

Massive improvements over vanilla MLPs. Compared to MLPs trained without distillation, InfGraND improves accuracy by an average of 12.6% in transductive settings and 9.3% in inductive settings.

Speed without sacrifice. Here’s where it gets practical:

Figure 5: Trade-off between accuracy and inference time. InfGraND achieves higher accuracy than GNN variants while being 6-14x faster.

InfGraND achieves 4.3% higher accuracy than GraphSAGE while being 8.56x faster. Compared to GAT, it’s 4.8% more accurate and 13.89x faster. You get better results with dramatically lower latency.

Performance in Label-Scarce Settings

Real-world graphs often have very few labeled nodes. We tested InfGraND with only 10%, 20%, and 40% of the original training labels.

Figure 6: Performance comparison under label-scarce settings. InfGraND consistently outperforms GLNN across different label rates.

InfGraND outperforms GLNN by an average of 4.17% across these label-scarce scenarios. The influence-guided objective helps the model focus on the most informative nodes when supervision is limited.

Ablation: What Contributes What?

We also isolated the contributions of each component:

Component	Effect
Influence-guided loss only	Improves over GLNN and vanilla MLP consistently
Feature propagation only	Large gains in inductive settings (+10-21% over vanilla MLP)
Full InfGraND (both)	Best overall performance

Both components contribute, and they’re complementary. The influence weighting helps the model learn from the right nodes, while feature propagation gives the MLP access to structural information.

What We Learned

A few takeaways from this work:

The right question matters. Shifting from “how confident is the teacher?” to “how influential is this node?” led to consistent improvements. Sometimes reframing the problem is more valuable than building more complex solutions.

Structure can be baked in without runtime cost. The one-time feature propagation gives MLPs structural awareness at zero inference cost. This is inspired by industrial practices like embedding lookup tables — precompute what you can.

Simpler students can outperform complex teachers. With the right training signal, MLPs can exceed GNN performance while being much faster. This has real implications for deploying graph-based models in production.

Limitations and Future Directions

InfGraND currently focuses on homophilic graphs — graphs where connected nodes tend to have similar labels. Extending to heterophilic graphs, where neighbors often have different labels, is an open direction.

We’re also interested in combining influence-based discrimination with entropy-based approaches. They capture different aspects of node importance, and a hybrid method might get the best of both worlds.

Finally, applying these ideas to dynamic graphs, where structure evolves over time, presents interesting challenges for maintaining and updating influence scores efficiently.

Try It Yourself

The code is available on GitHub: https://github.com/AmEskandari/InfGraND

The paper is published in TMLR and available on OpenReview: https://openreview.net/forum?id=lfzHR3YwlD

If you’re working on graph-based applications where inference speed matters, give InfGraND a try. And if you have questions or ideas for extensions, feel free to reach out.

References

Eskandari, A., Anand, A., Rashno, E., & Zulkernine, F. (2026). InfGraND: An Influence-Guided GNN-to-MLP Knowledge Distillation. Transactions on Machine Learning Research.
Zhang, S., Liu, Y., Sun, Y., & Shah, N. (2022). Graph-less Neural Networks: Teaching Old MLPs New Tricks via Distillation. ICLR.
Wu, L., Lin, H., Huang, Y., & Li, S. Z. (2023). Quantifying the Knowledge in GNNs for Reliable Distillation into MLPs. ICML.
Kipf, T. N., & Welling, M. (2016). Semi-Supervised Classification with Graph Convolutional Networks. ICLR.
Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive Representation Learning on Large Graphs. NeurIPS.

Share on

Twitter Facebook Google+ LinkedIn

Not All Nodes Are Equal: Rethinking Knowledge Distillation for Graphs

Not All Nodes Are Equal: Rethinking Knowledge Distillation for Graphs

The Deployment Problem: GNNs Are Great, But…

The Simple Alternative: MLPs

Bridging the Gap: Knowledge Distillation

The Question Everyone Was Asking (And Why It’s Wrong)

The Pebble in the Pond

Measuring Node Influence

Does Influence Actually Matter?

InfGraND: Putting It All Together

1. Influence-Guided Distillation Loss

2. One-Time Feature Propagation

The Complete Objective

Results: Does It Work?

Key Findings

Performance in Label-Scarce Settings

Ablation: What Contributes What?

What We Learned

Limitations and Future Directions

Try It Yourself

References

Share on

You May Also Enjoy

Understanding Layer Normalization: From Theory to Practice

Layer Normalization: A Deep Dive