This year, DeepSeek released a paper that doesn’t introduce a new loss function, a bigger dataset, or another training trick. Instead, it revisits something far more basic and far more fragile:
“How information flows through very deep models.”
At the center of the paper is a quiet but powerful claim: many recent architectural ideas improve performance, but they do so by weakening the very principle that made deep learning work in the first place. DeepSeek’s contribution, Manifold-Constrained Hyper-Connections (mHC), revisits that principle while retaining the benefits of modern architectures.
To understand why this matters, we need to start much earlier: before Transformers, before LLMs, even before scale.
In the sections ahead, we’ll unpack Manifold-Constrained Hyper-Connections (mHC) word by word and build up the full idea from first principles.
Long before large language models and billion-parameter training runs, deep learning faced a simpler challenge: depth itself was hard.
As networks grew deeper, information had to pass through more and more transformations. Each transformation slightly distorted the signal. Over dozens of layers, those small distortions accumulated. Gradients became unstable. Training slowed, then failed.
As networks grew deeper, the main challenge became preserving reliable information flow across layers.
Residual connections emerged as a structural solution to this problem.
Instead of forcing every layer to fully rewrite its input, residual networks allowed layers to add to what was already there. Information could pass forward unchanged if needed, with each layer contributing only minor corrections.
This introduced what’s known as an identity path: a guaranteed route through the network through which signals and gradients could flow without interference.
Once this idea appeared, the effect was immediate and far-reaching:
Depth stopped being a liability:
Adding layers no longer degraded performance by default; deeper networks could now represent more complex functions without collapsing.
Training stabilized:
Gradients flowed reliably through many layers, reducing sensitivity to initialization, learning rates, and other fragile tuning choices.
Scaling became practical:
Model capacity could grow by stacking layers, rather than relying solely on wider networks or heavier regularization.
Due to these properties, residual connections quickly evolved from a research insight to a default design choice: first in deep vision models and later as a foundational building block of Transformer architectures.
Transformers are often described in terms of attention, but structurally, they are built around residual connections.
A Transformer block contains two major components:
Multi-head attention
Feedforward networks
Every major component is wrapped in a residual connection.
Attention introduces shortcuts that enable information to move flexibly across the model. Residual connections maintain the main highway’s stability, ensuring a steady flow even when the shortcuts are noisy or congested.
Large language models followed the same pattern, repeating this structure hundreds of times.
As models grew, researchers began to notice a tension.
Residual connections are stable, but also rigid:
There is only one residual stream.
Every layer reads from it.
Every layer writes back into it.
All information, regardless of its nature, is channeled through the same system.
This raised a natural question:
What if residual connections themselves could be more expressive?
Hyper-Connections attempted to answer that question by expanding the residual stream.
Instead of a single highway running through the model, Hyper-Connections introduced multiple parallel residual streams. Layers could:
Read from several streams.
Mix the information between them.
Write back in more flexible ways.
Importantly, this increased architectural expressivity did not significantly increase FLOPs. The heavy computation still happened once per layer; the extra flexibility came from lightweight mixing operations.
At small and medium scales, this worked well. Models trained. Loss improved. Performance went up.
But something subtle changed.
Hyper-Connections replaced the single, stable highway with a network of interchanges.
Each layer learned to freely mix residual streams. Over a few layers, this flexibility was harmless. Over hundreds of layers, small imbalances began to compound.
Some streams accumulated more and more signal. Others faded. The implicit guarantee of residual learning is that information can always pass through unchanged, remaining unheld.
The model failed quietly, and only at scale:
Gradients became erratic
Training destabilized mid-run
Loss suddenly spiked
The issue stemmed from the architecture itself rather than the optimization process.
This is where DeepSeek’s contribution begins.
The challenge arose from unconstrained mixing between multiple residual streams. Manifold-Constrained Hyper-Connections start by restoring a simple principle:
Residual connections must conserve information:
Mixing is allowed.
Redistribution is allowed.
Amplification is not allowed.
The manifold defines the space in which residual mixing matrices are constrained, ensuring stable signal propagation across layers.
To see why this matters, recall what Hyper-Connections introduced: learnable matrices that mix information across multiple residual streams. These matrices are applied to every layer, and their effects compound as depth grows. When left unconstrained, their repeated multiplication can unpredictably amplify or suppress signals, breaking the identity-mapping behavior that residual networks rely on.
Manifold-Constrained Hyper-Connections (mHC) address this by restricting where those matrices can live.
Residual mixing matrices are constrained to be doubly stochastic: their entries are non-negative, and each row and column sums to one. This restriction defines a specific geometric space that governs how residual streams can mix.
This space forms a geometric object known as the Birkhoff polytope, which is the manifold referenced in DeepSeek’s paper.
The Birkhoff polytope is the space of all doubly stochastic matrices: mixing patterns where all entries are non-negative, and every row and column sums to one. It represents all balanced ways to redistribute information without amplifying or erasing it.
Consider a system with 3 residual streams. A valid mixing matrix could look like this:
All entries are non-negative
Each row sums to 1
Row 1: 0.5 + 0.3 + 0.2 = 1
Row 2: 0.2 + 0.5 + 0.3 = 1
Row 3: 0.3 + 0.2 + 0.5 = 1
Each column also sums to 1
Column 1: 0.5 + 0.2 + 0.3 = 1
Column 2: 0.3 + 0.5 + 0.2 = 1
Column 3: 0.2 + 0.3 + 0.5 = 1
This matrix neither amplifies nor erases information.
Each output stream is a weighted average of all input streams.
In road terms:
Traffic from each highway is redistributed.
No new cars appear.
No cars disappear.
No single road overwhelms the system.
This is exactly the kind of constraint that keeps residual signal flow stable as depth increases.
Each residual update becomes a convex combination of existing streams, rather than an arbitrary linear transformation.
Signal magnitude is preserved rather than amplified or dampened.
Gradient flow remains well-conditioned across layers.
Crucially, these properties are preserved even when residual matrices are multiplied across depth.
Because the set of doubly stochastic matrices is closed under multiplication, the stability guarantee holds not just locally, but across the entire network.
In practical terms, projecting residual connections onto this manifold means the model is free to redistribute information, but never allowed to distort it in ways that accumulate into instability. Expressivity is retained, but the fundamental conservation behavior of residual learning is restored.
In other words, the manifold acts as a structural guardrail, limiting the interaction of residual streams while still allowing them to mix richly and adaptively.
DeepSeek evaluates Manifold-Constrained Hyper-Connections through three complementary experiments, each designed to answer a different question:
Is training actually more stable?
Does this stability translate into better performance?
Do the benefits persist as models and as the computation scale increases?
Together, these experiments test not just whether mHC works, but how and why.
The primary results center on a 27B-parameter model, trained on a dataset scaled proportionally to model size. This configuration serves as the main system-level evaluation, where training stability, convergence behavior, and downstream performance can be meaningfully assessed at scale.
The first experiment focuses on the most immediate concern: training stability.
Using a 27B-parameter model, DeepSeek compares:
A standard Transformer baseline
Hyper-Connections (HC)
Manifold-Constrained Hyper-Connections (mHC)
At 27B scale, unconstrained Hyper-Connections show noisy loss curves and repeated gradient spikes, indicating unstable training. Manifold-Constrained Hyper-Connections maintain smooth loss improvement and controlled gradient norms, closely matching baseline stability while achieving better convergence.
Stability alone is not enough. The second experiment asks whether mHC’s cleaner training dynamics translate into better models.
The table evaluates the 27B models across eight diverse downstream benchmarks, including reasoning, commonsense, and knowledge-heavy tasks. The comparison covers both zero-shot and few-shot settings.
Across the board:
mHC consistently outperforms the baseline
mHC surpasses HC on the majority of benchmarks
The final set of experiments examines whether the benefits of mHC persist as the scale increases. This experiment presents two scaling analyses:
Compute scaling across 3B, 9B, and 27B parameter models
Token scaling within a single training run
In both cases, the performance advantage of mHC over the baseline is maintained as scale increases, with only marginal attenuation at higher compute budgets. This is a critical result: many architectural changes exhibit gains at the small scale but degrade as models increase in size; mHC does not.
Manifold-Constrained Hyper-Connections revisit a lesson deep learning has learned before: scale only works when information can move reliably through depth.
Residual connections made deep networks trainable by preserving a stable identity path. Hyper-Connections demonstrated that widening this path could enhance expressivity, but also revealed how easily stability can be compromised when residual mixing is left unconstrained. At small scales, this failure is easily overlooked. At large scales, it becomes unavoidable.
By constraining residual mixing to conserve information, mHC restores the core guarantee that residual learning depends on, while retaining the benefits of richer connectivity. The experimental results show that this is not just a theoretical fix: training becomes stable, performance improves across downstream tasks, and the gains persist as models scale.
More broadly, this work suggests a direction for future architectures. Progress does not always come from adding freedom. Sometimes it comes from adding the right constraints, constraints that encode what we already know about how deep networks survive scale.
As models continue to grow, architecture will matter as much as optimization. Manifold-Constrained Hyper-Connections offer a reminder that stability is not a byproduct of scale; it is a prerequisite for it.