Search⌘ K
AI Features

Graph Neural Networks in System Design

Explore how to design machine learning systems using graph neural networks (GNNs) to leverage multi-hop relational data. Learn to select appropriate GNN architectures like GraphSAGE and GAT, manage scalability challenges such as neighborhood explosion and inference latency, and apply practical solutions from production systems. This lesson equips you to make informed architecture and serving pattern decisions in ML system design interviews involving graph data.

The previous lesson explored architectures for unstructured and multimodal inputs, such as text and images. Many production ML systems also operate on data where relationships between entities are part of the signal. In these systems, entities are nodes, and edges capture how those entities are connected. Common examples include social networks, transaction graphs, and knowledge bases. When an interviewer asks you to design a connection recommendation system for a professional network, users, profiles, companies, connections, and interactions can be modeled as a heterogeneous graph with multiple node and edge types. The key design question is whether graph-based modeling improves recommendation quality enough to justify the extra training, serving, and operational complexity.

This lesson answers that question. Graph neural networks become the right architectural choice when the prediction target depends on a multi-hop relational contextInformation gathered by traversing two or more edges in a graph, capturing indirect relationships between entities that are not directly connected. that tabular features alone cannot capture. A user’s direct friends tell you something, but the pattern of connections two or three hops away can reveal far more about affinity, risk, or missing knowledge.

Three canonical production use cases consistently justify GNNs in system design discussions:

  • Social network recommendation: The system predicts new links or ranks candidates by leveraging neighborhood context, surfacing people you are likely to know based on shared second-degree connections and community structure.

  • Fraud detection: Anomalous subgraph patterns in transaction networks, such as rings of accounts rapidly passing funds, expose fraudulent behavior that per-transaction features miss entirely.

  • Knowledge graph completion: The system predicts missing relations between entities, enabling downstream applications like search enrichment and question answering.

Using a GNN is a system design trade-off. Graph structure can help capture relationships, neighborhoods, and multi-hop dependencies, but it also adds scalability costs around graph construction, neighbor sampling, training, and serving. The rest of this lesson explains those costs.

The following question tests whether you can distinguish when a GNN is genuinely warranted from when it adds unnecessary overhead:

Lesson Quiz

1.

You are designing a product recommendation system. User purchase history is available as tabular features, and there is also a social graph of user connections. The interviewer asks whether you need a GNN. What is the correct approach?

A.

Always use a GNN when a graph exists

B.

Use a GNN only if multi-hop relational context demonstrably improves prediction quality beyond tabular baselines

C.

Never use GNNs because they are too expensive

D.

Use a GNN only for cold-start users


1 / 1

With the “when” established, the next step is choosing a concrete GNN architecture that fits the system’s requirements.

GraphSAGE and GAT as architecture choices

Every GNN operates through a message-passing paradigmA computation pattern where each node in a graph iteratively aggregates feature information from its neighbors to update its own representation vector.. At each layer, a node collects features from its neighbors, aggregates them, and updates its own embedding. The architecture choice determines how that aggregation happens and what trade-offs the system inherits.

GraphSAGE as the production default

GraphSAGE performs inductive learning by sampling a fixed-size neighborhood and aggregating it with a simple function like mean pooling or an LSTM. Because it learns an ...