Home/Blog/Machine Learning/Making sense of Kolmogorov-Arnold Networks (KANs)
Home/Blog/Machine Learning/Making sense of Kolmogorov-Arnold Networks (KANs)

Making sense of Kolmogorov-Arnold Networks (KANs)

12 min read
Jan 06, 2025
content
What is a Kolmogorov-Arnold network?
KAN as a function approximation
In what sense are the edge activation functions learnable?
Tweaks to activation functions to fix the vanishing gradient
Initializations
Interacting with the network
Experimental evidence: KAN vs. MLP
Regularization of KANs
In what way are KANs more interpretable than MLPs?
Increased accuracy through continual learning
Final thoughts

From the humble perceptron to today’s deep, multilayered architectures, neural networks have evolved considerably, driven by key ideas like backpropagation and specialized layers. On this landscape, the recent development of Kolmogorov-Arnold Networks (KANs) introduces an alternative design to traditional neural networks. While it’s too soon to say if this formulation will lead to something bigger, the initial results are quite inspiring. Let’s look into what constitutes a KAN, how to make sense of it, and what it brings to the table.

What is a Kolmogorov-Arnold network?#

Kolmogorov-Arnold networks (KANs) are structurally like multilayer perceptrons (MLP), i.e., they are fully-connected neural networks, but with these differences:

  • Instead of learnable weights on the edges, KANs have non-linear, learnable activation functions on the edges.

  • Instead of fixed biases and activation functions on the nodes, nodes in a KAN only perform the sum of the incoming values.

    In other words, if ϕi,j\phi_{i,j} is the activation on an edge from a node ii to a node jj, we can view the node jj in isolation as computing the sum:

    i=1nϕi,j(xj)\sum_{i=1}^n \phi_{i,j} (x_j)

Activation functions on edges and summation on nodes
Activation functions on edges and summation on nodes

Now that we understand the structure of KANs, let’s explore their primary purpose: approximating continuous functions.

KAN as a function approximation#

The long and short of it is that a Kolmogorov-Arnold network can be seen as a special type of MLP that approximates a continuous function on nn unknowns.

This notion is inspired by the Kolmogorov-Arnold Representation theorem (KART), which states that any continuous function f:[0,1]nRf:[0,1]^n\rightarrow \mathbb{R} on nn variables can be expressed in the following way:

f(x1,,xn)=j=12n+1ψj(i=1nϕi,j(xi))f(x_1, \cdots, x_n) = \sum_{j=1}^{2n+1}\psi_j\left(\sum_{i=1}^n \phi_{i,j}(x_i)\right)

Here, the outer functions ψj:RR\psi_j: \mathbb{R}\rightarrow \mathbb{R} and the inner functions ϕj,i:[0,1]R\phi_{j,i}:[0,1]\rightarrow \mathbb{R} are all univariate functions, i.e., defined on a single variable. Each function ψj\psi_j is applied to the value i=1nϕi,j(xi)\sum_{i=1}^n \phi_{i,j}(x_i), and the outputs of these functions are then summed together. Note that even though the function ff is continuous, the univariate functions need not be continuous.

This double summation in the theorem (with the inner one running from 11 to nn, and the outer one running from 11 to 2n+12n+1) can be visualized as a network with nn input nodes, 2n+12n+1 hidden nodes, and 1 output node. Here’s an example of such a network when n=2n=2 and 2n+1=52n+1=5:

A KAN with two inputs corresponding to the  Kolmogorov-Arnold representation theorem
A KAN with two inputs corresponding to the Kolmogorov-Arnold representation theorem

Unfortunately, the above theorem is not a rock-solid mathematical justification for such a 3-layered KAN because the univariate functions in the theorem are not necessarily continuous. The theorem also does not apply to networks of a more general shape, but it does provide an intuition that the authors draw upon to generalize a KAN by adding more layers, so structurally it looks just like any other multilayer perceptron (MLP) with an arbitrary number of nodes in each layer.

The interesting part is that the function computed by the KAN can be described as nested sums of function compositions. So, for example, if we think of ϕl,i,j\phi_{l,i,j} as the activation function on an edge going from node ii in layer ll to a node jj in layer l+1l+1, then for a network on four layers, the function computed by the KAN at a particular output node tt looks like:

k=1n3ϕ3,k,t(j=1n2ϕ2,j,k(i=1n1ϕ1,i,j(xi))l=1l=2)l=3\underbrace{\sum_{k=1}^{n_3} \phi_{3,k,t} \big( \underbrace{\sum_{j=1}^{n_2} \phi_{2, j, k} \underbrace{\big(\sum_{i=1}^{n_1} \phi_{1, i, j}(x_i)\big)}_{l = 1}}_{l=2}\big)}_{l=3}

where nln_l is the number of nodes in the lthl^{th} layer.

An arbitrary KAN on four layers
An arbitrary KAN on four layers

In what sense are the edge activation functions learnable?#

The value taken by a univariate function on an edge is the activation value, which changes as the network is trained. However, this change can be thought of as learning the univariate activation function itself. Here are all the parts of the puzzle that can be fit together to see how an activation function is learnable:

  • A univariate function that’s continuous and smooth can well be approximated by a basic spline function (or a B-spline curve). A B-spline curve is a function made of polynomial curves of the same degree that are pieced together smoothly at their endpoints. This is illustrated in the figure below.

    Splines are often encountered in graphics and other engineering applications, such as charting a robot’s path or approximating a surface.

The plot of a degree-2 B-spline function (left) and how its segments are smoothly pieced together (right)
The plot of a degree-2 B-spline function (left) and how its segments are smoothly pieced together (right)

  • For those of us familiar with vector spaces, B-spline functions can also serve as elements of a vector space. So, a B-spline activation function ϕ\phi on one variable can be written as a linear combination of B-spline basis functions BiB_i and some coefficients cic_i. In other words, an activation value ϕ(x)\phi(x) can be thought of as:

    ϕ(x)=iciBi(x)\phi(x) = \sum_i c_i B_{i}(x)

    The following image shows some B-spline basis functions (in different colors) on the left. These combine to form a spline function of degree 2 (using the coefficients 0.5, 0.8, 0.4, 0.9, 0.2, 0.8, respectively).

The spline function of degree 2 (right) is formed using a linear combination of B-spline basis functions (left)
The spline function of degree 2 (right) is formed using a linear combination of B-spline basis functions (left)
  • Now it turns out that the B-spline basis functions BiB_{i} are uniquely determined by two things: the degree of the polynomial segments kk and the grid size GG, where GG is the number of uniformly sized intervals along the x-axis (the domain). In the example illustrated above (on the right), GG is 44.

    Observe that each of the four intervals along the domain axis is mapped to a polynomial segment of the spline function.

  • In the KAN implemented in the original paper, both kk and GG are taken as the hyperparameters. This means that the basis functions for the model are predetermined when we start the training. Any changes to activation values during training imply that it’s actually just the coefficients cic_i in the linear combinations that are being learned.

Tweaks to activation functions to fix the vanishing gradient#

Motivated by empirical outcomes, more changes are made to the activation functions to fix the vanishing gradient problem.

A basis function is added to the univariate spline function (inspired by the spirit of residual connections seen in ResNets). The basis function used is the well-known SiLU (Sigmoid Linear Unit) function. So, the actual activation function on each edge has this form:

ϕl,j,i(x)=w1 SiLU(x)+w2 Spline(xi)=w1 SiLU(xi)+w2iciBi(x)\begin{align*} \phi_{l,j,i}(x) &= w_1\ \text{SiLU}(x) + w_2\ \text{Spline(}x_i\text{)}\\ &= w_1\ \text{SiLU}(x_i) + w_2 \sum_i c_i B_{i}(x) \end{align*}

where Bi(x)B_i(x) are the B-spline basis functions.

In other words, it’s not just the coefficients in the univariate spline functions that are learned; the weights w1w_1 and w2w_2 are learned as well.

Initializations#

The network is initialized as follows:

  • The coefficient w1w_1 is initialized using Xavier initialization.

  • The coefficient w2w_2 is initialized to 1.

  • The coefficients cic_i appearing in the linear combination of B-spline basis functions are initialized to small random numbers close to zero (picked from a normal distribution with mean 0 and, typically, standard deviation 0.1).

Interacting with the network#

KANs are implemented by the authors in the Python package pykan. Using pykan, it’s also possible to interact with the model manually during training, such as:

  • The network can be visualized with the edge activation functions visualized as curves.
  • Insignificant nodes with incoming and outgoing edge activation values lower than a certain threshold can be pruned away.
  • Spline functions are defined on bounded regions, but during training the values can fall outside the default bounded region (the grid). In such a case, the grid can be expanded as needed.
  • Based on an activation value along an edge, a corresponding activation function can be automatically proposed and viewed as a symbolic formula.
  • To test out a working hypothesis about an activation function, it can be explicitly specified to replace the automatically assigned activation function.

Experimental evidence: KAN vs. MLP#

The experiments conducted in this paper are very niche—specific to the disciplines of math and science—but the results are spectacular. The authors build KANs against 15 well-known special functions in math and physics. They also compare results for math problems (pertaining to knot theory) and a physics problem (Anderson localization).

Examples of knots that are studied as mathematical objects in knot theory
Examples of knots that are studied as mathematical objects in knot theory

Here are some takeaways:

  • Number of parameters: KANs require more parameters than an MLP of the same size.

  • Parameter efficiency (with a pinch of salt): Smaller KANs may have an accuracy comparable to that of much larger MLPs:

    • For a knot theory problemThey passed 17 knot invariants (features) as input to predict an 18th one. A knot invariant is a mathematical property of a knot that helps prove when it is inequivalent to another knot, i.e., it cannot be transformed into the other using twists and stretches., they were able to achieve 81.6% accuracy with a 2-layer KAN on 200 parameters. Compare this to the 4-layer MLP for the same problem, by Google’s Deepmind, which has 3×1053×10^5 parameters and 78% accuracy! However, in the aftermath of the first preprint of their paper, the authors were informed that Georgia Tech’s Prof. Shi Lab was able to achieve 80% accuracy with an MLP on 60 parameters.

    • For a problem requiring a solution to a partial differential equation, the error (MSE) for a 2-layer width-10 KAN (with 100 parameters) was a hundred times lower than a 4-layer width-100 MLP (with 10,000 parameters).

  • Slower training: KANs take more time to train. For some of the experiments, training a KAN is 10 times slower than an MLP with the same number of parameters.

Regularization of KANs#

When L1 regularization is applied to MLPs, some of the weights become zero. The zero-weighted edges can be thought of as insignificant relationships that are effectively removed from the MLP. This sparsification can help mitigate overfitting and can also lead to a more interpretable neural network.

The authors observed that L1 regularization alone did not achieve sufficient sparsification in the KANs where it was applied. It also led to the duplication of activation functions across many edges. To address these issues, Entropy regularization was used alongside L1 regularization.

It’s interesting to note that the L1 regularization and Entropy regularization terms are defined differently for KANs than they are for MLPs since the edges have associated activation functions instead of weights.

In what way are KANs more interpretable than MLPs?#

The following image, produced using the pykan package and after pruning, is taken from the original paper. It’s hard at first to assign meaning to the squiggly lines representing activation functions until it’s pointed out to us.

KANs for multiplication and division.  Source: "KAN: Kolmogorov-Arnold Network" by Liu et al. (https://arxiv.org/pdf/2404.19756)
KANs for multiplication and division. Source: "KAN: Kolmogorov-Arnold Network" by Liu et al. (https://arxiv.org/pdf/2404.19756)

The interpretation of the KAN for multiplication is that the formula xyxy is broken down into smaller parts that combine to form an approximation of xyxy. This is illustrated with the help of annotations below:

Interpreting a KAN that multiplies
Interpreting a KAN that multiplies

Two of the edges coming from the input layer appear to have linear curves drawn on them; two look like quadratic parabolas. The nodes are just summing the inputs. In short, by examining the curves along each edge, we can interpret how the different parts of the network function.

Here are the annotations to indicate how the KAN for division is interpreted:

Interpretting a KAN that divides
Interpretting a KAN that divides

The claim that a KAN is more interpretable than an MLP is plausible if we are working with a small and sparse network that models a mathematical relationship. Indeed, the authors rediscover a result in knot theory by looking through such an interpretive lens. However, for larger networks designed for more complex tasks, like natural language processing, this claim is less convincing.

Increased accuracy through continual learning#

When a KAN model was trained by the authors in a sequence of steps (for a contrived problem) and was fed new data at each step, it was observed that KANs remember past learning from previous steps, whereas MLPs forget whatever was learned in the previous steps. Below is the image (from the paper) that shows this preliminary finding for a toy regression problem where the data is generated using a Gaussian function (see the first row); it shows how well the KAN and MLP models learn over time by training with new data at each successive step (see the 2nd and 3rd rows):

How KANs exhibit continual learning and MLPs exhibit forgetting. Source: "KAN: Kolmogorov-Arnold Network" by Liu et al.  (https://arxiv.org/pdf/2404.19756)
How KANs exhibit continual learning and MLPs exhibit forgetting. Source: "KAN: Kolmogorov-Arnold Network" by Liu et al. (https://arxiv.org/pdf/2404.19756)

The authors attribute this to the fact that B-spline basis functions are only locally nonzero; the data fed in each successive phase induces a change in coefficients of basis functions that express local parts of the spline curves. Other coefficients with values learned during previous phases remain possibly unchanged.

A subsequently published arXiv article notes, however, that this kind of continual learning was not seen for a different setup that used the higher dimensional MNIST dataset with the digits 0, 1, and 2 fed in the first phase, then 3, 4, and 5 in the second, and 6, 7, and 8 in the third phase.

Final thoughts#

Based on first impressions, Kolmorogorov-Arnold networks are an intriguing variation of neural networks with potential for applications in mathematics and science. Their distinctive way of modeling functions as sums of compositions deserves more attention, creating avenues for further exploration in theoretical areas.

KANs demonstrate one-way deep learning continues to evolve, introducing variations on traditional neural network designs. For an in-depth, hands-on introduction to neural networks, explore this comprehensive course by Educative.

Cover
Introduction to Deep Learning & Neural Networks

This course is an accumulation of well-grounded knowledge and experience in deep learning. It provides you with the basic concepts you need in order to start working with and training various machine learning models. You will cover both basic and intermediate concepts including but not limited to: convolutional neural networks, recurrent neural networks, generative adversarial networks as well as transformers. After completing this course, you will have a comprehensive understanding of the fundamental architectural components of deep learning. Whether you’re a data and computer scientist, computer and big data engineer, solution architect, or software engineer, you will benefit from this course.

4hrs 30mins
Intermediate
11 Challenges
8 Quizzes

Written By:
Mehvish Poshni
Join 2.5 million developers at
Explore the catalog

Free Resources