From the humble perceptron to today’s deep, multilayered architectures, neural networks have evolved considerably, driven by key ideas like backpropagation and specialized layers. On this landscape, the recent development of Kolmogorov-Arnold Networks (KANs) introduces an alternative design to traditional neural networks. While it’s too soon to say if this formulation will lead to something bigger, the initial results are quite inspiring. Let’s look into what constitutes a KAN, how to make sense of it, and what it brings to the table.
Kolmogorov-Arnold networks (KANs) are structurally like multilayer perceptrons (MLP), i.e., they are fully-connected neural networks, but with these differences:
Instead of learnable weights on the edges, KANs have non-linear, learnable activation functions on the edges.
Instead of fixed biases and activation functions on the nodes, nodes in a KAN only perform the sum of the incoming values.
In other words, if is the activation on an edge from a node to a node , we can view the node in isolation as computing the sum:
Now that we understand the structure of KANs, let’s explore their primary purpose: approximating continuous functions.
The long and short of it is that a Kolmogorov-Arnold network can be seen as a special type of MLP that approximates a continuous function on unknowns.
This notion is inspired by the Kolmogorov-Arnold Representation theorem (KART), which states that any continuous function on variables can be expressed in the following way:
Here, the outer functions and the inner functions are all univariate functions, i.e., defined on a single variable. Each function is applied to the value , and the outputs of these functions are then summed together. Note that even though the function is continuous, the univariate functions need not be continuous.
This double summation in the theorem (with the inner one running from to , and the outer one running from to ) can be visualized as a network with input nodes, hidden nodes, and 1 output node. Here’s an example of such a network when and :
Unfortunately, the above theorem is not a rock-solid mathematical justification for such a 3-layered KAN because the univariate functions in the theorem are not necessarily continuous. The theorem also does not apply to networks of a more general shape, but it does provide an intuition that the authors draw upon to generalize a KAN by adding more layers, so structurally it looks just like any other multilayer perceptron (MLP) with an arbitrary number of nodes in each layer.
The interesting part is that the function computed by the KAN can be described as nested sums of function compositions. So, for example, if we think of as the activation function on an edge going from node in layer to a node in layer , then for a network on four layers, the function computed by the KAN at a particular output node looks like:
where is the number of nodes in the layer.
The value taken by a univariate function on an edge is the activation value, which changes as the network is trained. However, this change can be thought of as learning the univariate activation function itself. Here are all the parts of the puzzle that can be fit together to see how an activation function is learnable:
A univariate function that’s continuous and smooth can well be approximated by a basic spline function (or a B-spline curve). A B-spline curve is a function made of polynomial curves of the same degree that are pieced together smoothly at their endpoints. This is illustrated in the figure below.
Splines are often encountered in graphics and other engineering applications, such as charting a robot’s path or approximating a surface.
For those of us familiar with vector spaces, B-spline functions can also serve as elements of a vector space. So, a B-spline activation function on one variable can be written as a linear combination of B-spline basis functions and some coefficients . In other words, an activation value can be thought of as:
The following image shows some B-spline basis functions (in different colors) on the left. These combine to form a spline function of degree 2 (using the coefficients 0.5, 0.8, 0.4, 0.9, 0.2, 0.8, respectively).
Now it turns out that the B-spline basis functions are uniquely determined by two things: the degree of the polynomial segments and the grid size , where is the number of uniformly sized intervals along the x-axis (the domain). In the example illustrated above (on the right), is .
Observe that each of the four intervals along the domain axis is mapped to a polynomial segment of the spline function.
In the KAN implemented in the original paper, both and are taken as the hyperparameters. This means that the basis functions for the model are predetermined when we start the training. Any changes to activation values during training imply that it’s actually just the coefficients in the linear combinations that are being learned.
Motivated by empirical outcomes, more changes are made to the activation functions to fix the vanishing gradient problem.
A basis function is added to the univariate spline function (inspired by the spirit of residual connections seen in ResNets). The basis function used is the well-known SiLU (Sigmoid Linear Unit) function. So, the actual activation function on each edge has this form:
where are the B-spline basis functions.
In other words, it’s not just the coefficients in the univariate spline functions that are learned; the weights and are learned as well.
The network is initialized as follows:
The coefficient is initialized using Xavier initialization.
The coefficient is initialized to 1.
The coefficients appearing in the linear combination of B-spline basis functions are initialized to small random numbers close to zero (picked from a normal distribution with mean 0 and, typically, standard deviation 0.1).
KANs are implemented by the authors in the Python package pykan. Using pykan, it’s also possible to interact with the model manually during training, such as:
The experiments conducted in this paper are very niche—specific to the disciplines of math and science—but the results are spectacular. The authors build KANs against 15 well-known special functions in math and physics. They also compare results for math problems (pertaining to knot theory) and a physics problem (Anderson localization).
Here are some takeaways:
Number of parameters: KANs require more parameters than an MLP of the same size.
Parameter efficiency (with a pinch of salt): Smaller KANs may have an accuracy comparable to that of much larger MLPs:
For
For a problem requiring a solution to a partial differential equation, the error (MSE) for a 2-layer width-10 KAN (with 100 parameters) was a hundred times lower than a 4-layer width-100 MLP (with 10,000 parameters).
Slower training: KANs take more time to train. For some of the experiments, training a KAN is 10 times slower than an MLP with the same number of parameters.
When L1 regularization is applied to MLPs, some of the weights become zero. The zero-weighted edges can be thought of as insignificant relationships that are effectively removed from the MLP. This sparsification can help mitigate overfitting and can also lead to a more interpretable neural network.
The authors observed that L1 regularization alone did not achieve sufficient sparsification in the KANs where it was applied. It also led to the duplication of activation functions across many edges. To address these issues, Entropy regularization was used alongside L1 regularization.
It’s interesting to note that the L1 regularization and Entropy regularization terms are defined differently for KANs than they are for MLPs since the edges have associated activation functions instead of weights.
The following image, produced using the pykan package and after pruning, is taken from the original paper. It’s hard at first to assign meaning to the squiggly lines representing activation functions until it’s pointed out to us.
The interpretation of the KAN for multiplication is that the formula is broken down into smaller parts that combine to form an approximation of . This is illustrated with the help of annotations below:
Two of the edges coming from the input layer appear to have linear curves drawn on them; two look like quadratic parabolas. The nodes are just summing the inputs. In short, by examining the curves along each edge, we can interpret how the different parts of the network function.
Here are the annotations to indicate how the KAN for division is interpreted:
The claim that a KAN is more interpretable than an MLP is plausible if we are working with a small and sparse network that models a mathematical relationship. Indeed, the authors rediscover a result in knot theory by looking through such an interpretive lens. However, for larger networks designed for more complex tasks, like natural language processing, this claim is less convincing.
When a KAN model was trained by the authors in a sequence of steps (for a contrived problem) and was fed new data at each step, it was observed that KANs remember past learning from previous steps, whereas MLPs forget whatever was learned in the previous steps. Below is the image (from the paper) that shows this preliminary finding for a toy regression problem where the data is generated using a Gaussian function (see the first row); it shows how well the KAN and MLP models learn over time by training with new data at each successive step (see the 2nd and 3rd rows):
The authors attribute this to the fact that B-spline basis functions are only locally nonzero; the data fed in each successive phase induces a change in coefficients of basis functions that express local parts of the spline curves. Other coefficients with values learned during previous phases remain possibly unchanged.
A subsequently published arXiv article notes, however, that this kind of continual learning was not seen for a different setup that used the higher dimensional MNIST dataset with the digits 0, 1, and 2 fed in the first phase, then 3, 4, and 5 in the second, and 6, 7, and 8 in the third phase.
Based on first impressions, Kolmorogorov-Arnold networks are an intriguing variation of neural networks with potential for applications in mathematics and science. Their distinctive way of modeling functions as sums of compositions deserves more attention, creating avenues for further exploration in theoretical areas.
KANs demonstrate one-way deep learning continues to evolve, introducing variations on traditional neural network designs. For an in-depth, hands-on introduction to neural networks, explore this comprehensive course by Educative.
This course is an accumulation of well-grounded knowledge and experience in deep learning. It provides you with the basic concepts you need in order to start working with and training various machine learning models. You will cover both basic and intermediate concepts including but not limited to: convolutional neural networks, recurrent neural networks, generative adversarial networks as well as transformers. After completing this course, you will have a comprehensive understanding of the fundamental architectural components of deep learning. Whether you’re a data and computer scientist, computer and big data engineer, solution architect, or software engineer, you will benefit from this course.
Free Resources