Home/Blog/Machine Learning/Making sense of Kolmogorov-Arnold Networks (KANs)

Making sense of Kolmogorov-Arnold Networks (KANs)

12 min read

Jan 06, 2025

content

What is a Kolmogorov-Arnold network?

KAN as a function approximation

In what sense are the edge activation functions learnable?

Tweaks to activation functions to fix the vanishing gradient

Initializations

Interacting with the network

Experimental evidence: KAN vs. MLP

Regularization of KANs

In what way are KANs more interpretable than MLPs?

Increased accuracy through continual learning

Final thoughts

From the humble perceptron to today’s deep, multilayered architectures, neural networks have evolved considerably, driven by key ideas like backpropagation and specialized layers. On this landscape, the recent development of Kolmogorov-Arnold Networks (KANs) introduces an alternative design to traditional neural networks. While it’s too soon to say if this formulation will lead to something bigger, the initial results are quite inspiring. Let’s look into what constitutes a KAN, how to make sense of it, and what it brings to the table.

What is a Kolmogorov-Arnold network?#

Kolmogorov-Arnold networks (KANs) are structurally like multilayer perceptrons (MLP), i.e., they are fully-connected neural networks, but with these differences:

Instead of learnable weights on the edges, KANs have non-linear, learnable activation functions on the edges.
Instead of fixed biases and activation functions on the nodes, nodes in a KAN only perform the sum of the incoming values.

In other words, if $\phi_{i,j}$ is the activation on an edge from a node $i$ to a node $j$ , we can view the node $j$ in isolation as computing the sum:
$\sum_{i=1}^n \phi_{i,j} (x_j)$

Now that we understand the structure of KANs, let’s explore their primary purpose: approximating continuous functions.

KAN as a function approximation#

The long and short of it is that a Kolmogorov-Arnold network can be seen as a special type of MLP that approximates a continuous function on $n$ unknowns.

This notion is inspired by the Kolmogorov-Arnold Representation theorem (KART), which states that any continuous function $f:[0,1]^n\rightarrow \mathbb{R}$ on $n$ variables can be expressed in the following way:

f(x_1, \cdots, x_n) = \sum_{j=1}^{2n+1}\psi_j\left(\sum_{i=1}^n \phi_{i,j}(x_i)\right)

Here, the outer functions $\psi_j: \mathbb{R}\rightarrow \mathbb{R}$ and the inner functions $\phi_{j,i}:[0,1]\rightarrow \mathbb{R}$ are all univariate functions, i.e., defined on a single variable. Each function $\psi_j$ is applied to the value $\sum_{i=1}^n \phi_{i,j}(x_i)$ , and the outputs of these functions are then summed together. Note that even though the function $f$ is continuous, the univariate functions need not be continuous.

This double summation in the theorem (with the inner one running from $1$ to $n$ , and the outer one running from $1$ to $2n+1$ ) can be visualized as a network with $n$ input nodes, $2n+1$ hidden nodes, and 1 output node. Here’s an example of such a network when $n=2$ and $2n+1=5$ :

Unfortunately, the above theorem is not a rock-solid mathematical justification for such a 3-layered KAN because the univariate functions in the theorem are not necessarily continuous. The theorem also does not apply to networks of a more general shape, but it does provide an intuition that the authors draw upon to generalize a KAN by adding more layers, so structurally it looks just like any other multilayer perceptron (MLP) with an arbitrary number of nodes in each layer.

The interesting part is that the function computed by the KAN can be described as nested sums of function compositions. So, for example, if we think of $\phi_{l,i,j}$ as the activation function on an edge going from node $i$ in layer $l$ to a node $j$ in layer $l+1$ , then for a network on four layers, the function computed by the KAN at a particular output node $t$ looks like:

\underbrace{\sum_{k=1}^{n_3} \phi_{3,k,t} \big( \underbrace{\sum_{j=1}^{n_2} \phi_{2, j, k} \underbrace{\big(\sum_{i=1}^{n_1} \phi_{1, i, j}(x_i)\big)}_{l = 1}}_{l=2}\big)}_{l=3}

where $n_l$ is the number of nodes in the $l^{th}$ layer.

In what sense are the edge activation functions learnable?#

The value taken by a univariate function on an edge is the activation value, which changes as the network is trained. However, this change can be thought of as learning the univariate activation function itself. Here are all the parts of the puzzle that can be fit together to see how an activation function is learnable:

A univariate function that’s continuous and smooth can well be approximated by a basic spline function (or a B-spline curve). A B-spline curve is a function made of polynomial curves of the same degree that are pieced together smoothly at their endpoints. This is illustrated in the figure below.

Splines are often encountered in graphics and other engineering applications, such as charting a robot’s path or approximating a surface.

For those of us familiar with vector spaces, B-spline functions can also serve as elements of a vector space. So, a B-spline activation function $\phi$ on one variable can be written as a linear combination of B-spline basis functions $B_i$ and some coefficients $c_i$ . In other words, an activation value $\phi(x)$ can be thought of as:
$\phi(x) = \sum_i c_i B_{i}(x)$
The following image shows some B-spline basis functions (in different colors) on the left. These combine to form a spline function of degree 2 (using the coefficients 0.5, 0.8, 0.4, 0.9, 0.2, 0.8, respectively).

Now it turns out that the B-spline basis functions $B_{i}$ are uniquely determined by two things: the degree of the polynomial segments $k$ and the grid size $G$ , where $G$ is the number of uniformly sized intervals along the x-axis (the domain). In the example illustrated above (on the right), $G$ is $4$ .

Observe that each of the four intervals along the domain axis is mapped to a polynomial segment of the spline function.
In the KAN implemented in the original paper, both $k$ and $G$ are taken as the hyperparameters. This means that the basis functions for the model are predetermined when we start the training. Any changes to activation values during training imply that it’s actually just the coefficients $c_i$ in the linear combinations that are being learned.

Tweaks to activation functions to fix the vanishing gradient#

Motivated by empirical outcomes, more changes are made to the activation functions to fix the vanishing gradient problem.

A basis function is added to the univariate spline function (inspired by the spirit of residual connections seen in ResNets). The basis function used is the well-known SiLU (Sigmoid Linear Unit) function. So, the actual activation function on each edge has this form:

\begin{align*} \phi_{l,j,i}(x) &= w_1\ \text{SiLU}(x) + w_2\ \text{Spline(}x_i\text{)}\\ &= w_1\ \text{SiLU}(x_i) + w_2 \sum_i c_i B_{i}(x) \end{align*}

where $B_i(x)$ are the B-spline basis functions.

In other words, it’s not just the coefficients in the univariate spline functions that are learned; the weights $w_1$ and $w_2$ are learned as well.

Initializations#

The network is initialized as follows:

The coefficient $w_1$ is initialized using Xavier initialization.
The coefficient $w_2$ is initialized to 1.
The coefficients $c_i$ appearing in the linear combination of B-spline basis functions are initialized to small random numbers close to zero (picked from a normal distribution with mean 0 and, typically, standard deviation 0.1).

Interacting with the network#

KANs are implemented by the authors in the Python package pykan. Using pykan, it’s also possible to interact with the model manually during training, such as:

The network can be visualized with the edge activation functions visualized as curves.
Insignificant nodes with incoming and outgoing edge activation values lower than a certain threshold can be pruned away.
Spline functions are defined on bounded regions, but during training the values can fall outside the default bounded region (the grid). In such a case, the grid can be expanded as needed.
Based on an activation value along an edge, a corresponding activation function can be automatically proposed and viewed as a symbolic formula.
To test out a working hypothesis about an activation function, it can be explicitly specified to replace the automatically assigned activation function.

Experimental evidence: KAN vs. MLP#

The experiments conducted in this paper are very niche—specific to the disciplines of math and science—but the results are spectacular. The authors build KANs against 15 well-known special functions in math and physics. They also compare results for math problems (pertaining to knot theory) and a physics problem (Anderson localization).

Here are some takeaways:

Number of parameters: KANs require more parameters than an MLP of the same size.
Parameter efficiency (with a pinch of salt): Smaller KANs may have an accuracy comparable to that of much larger MLPs:
- For a knot theory problemThey passed 17 knot invariants (features) as input to predict an 18th one. A knot invariant is a mathematical property of a knot that helps prove when it is inequivalent to another knot, i.e., it cannot be transformed into the other using twists and stretches., they were able to achieve 81.6% accuracy with a 2-layer KAN on 200 parameters. Compare this to the 4-layer MLP for the same problem, by Google’s Deepmind, which has $3×10^5$ parameters and 78% accuracy! However, in the aftermath of the first preprint of their paper, the authors were informed that Georgia Tech’s Prof. Shi Lab was able to achieve 80% accuracy with an MLP on 60 parameters.
- For a problem requiring a solution to a partial differential equation, the error (MSE) for a 2-layer width-10 KAN (with 100 parameters) was a hundred times lower than a 4-layer width-100 MLP (with 10,000 parameters).
Slower training: KANs take more time to train. For some of the experiments, training a KAN is 10 times slower than an MLP with the same number of parameters.

Regularization of KANs#

When L1 regularization is applied to MLPs, some of the weights become zero. The zero-weighted edges can be thought of as insignificant relationships that are effectively removed from the MLP. This sparsification can help mitigate overfitting and can also lead to a more interpretable neural network.

The authors observed that L1 regularization alone did not achieve sufficient sparsification in the KANs where it was applied. It also led to the duplication of activation functions across many edges. To address these issues, Entropy regularization was used alongside L1 regularization.

The claim that a KAN is more interpretable than an MLP is plausible if we are working with a small and sparse network that models a mathematical relationship. Indeed, the authors rediscover a result in knot theory by looking through such an interpretive lens. However, for larger networks designed for more complex tasks, like natural language processing, this claim is less convincing.

Increased accuracy through continual learning#

When a KAN model was trained by the authors in a sequence of steps (for a contrived problem) and was fed new data at each step, it was observed that KANs remember past learning from previous steps, whereas MLPs forget whatever was learned in the previous steps. Below is the image (from the paper) that shows this preliminary finding for a toy regression problem where the data is generated using a Gaussian function (see the first row); it shows how well the KAN and MLP models learn over time by training with new data at each successive step (see the 2nd and 3rd rows):

The authors attribute this to the fact that B-spline basis functions are only locally nonzero; the data fed in each successive phase induces a change in coefficients of basis functions that express local parts of the spline curves. Other coefficients with values learned during previous phases remain possibly unchanged.

A subsequently published arXiv article notes, however, that this kind of continual learning was not seen for a different setup that used the higher dimensional MNIST dataset with the digits 0, 1, and 2 fed in the first phase, then 3, 4, and 5 in the second, and 6, 7, and 8 in the third phase.

Final thoughts#

Based on first impressions, Kolmorogorov-Arnold networks are an intriguing variation of neural networks with potential for applications in mathematics and science. Their distinctive way of modeling functions as sums of compositions deserves more attention, creating avenues for further exploration in theoretical areas.

KANs demonstrate one-way deep learning continues to evolve, introducing variations on traditional neural network designs. For an in-depth, hands-on introduction to neural networks, explore this comprehensive course by Educative.

Introduction to Deep Learning & Neural Networks

This course is an accumulation of well-grounded knowledge and experience in deep learning. It provides you with the basic concepts you need in order to start working with and training various machine learning models. You will cover both basic and intermediate concepts including but not limited to: convolutional neural networks, recurrent neural networks, generative adversarial networks as well as transformers. After completing this course, you will have a comprehensive understanding of the fundamental architectural components of deep learning. Whether you’re a data and computer scientist, computer and big data engineer, solution architect, or software engineer, you will benefit from this course.

4hrs 30mins

Intermediate

11 Challenges

8 Quizzes

Written By:

Mehvish Poshni