What are kernels in machine learning and SVM and why do we need them?

Question

Anonymous · Accepted Answer

Intuitively, a kernel is just a transformation of your input data that allows you (or an algorithm like SVMs) to treat/process it more easily. Imagine that we have the toy problem of separating the red circles from the blue crosses on a plane as shown below.

Our separating surface would be the ellipse drawn on the left figure. However, transforming our data into a 3 dimensional space through the mapping shown in the figure would make the problem much easier since, now, our points are separated by a simple plane. This embedding on a higher dimension is called the kernel trick.

In conclusion, and very informally, a kernel consists on embedding general points into an inner product space.

PS: I have taken this graph from http://www.sussex.ac.uk/Users/christ/crs/ml/lec08a.html, which is also shown in Hastie and Tibshirani: Elements of Statistical Learning.

Anonymous · Answer

Briefly speaking, a kernel is a shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.

Mathematical definition: K(x, y) = %3Cf(x), f(y)%3E. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. %3C x,y%3E denotes the dot product. usually m is much larger than n.

Intuition: normally calculating %3Cf(x), f(y)%3E requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional space, where m can be a large number. But after all the trouble of going to the high dimensional space, the result of the dot product is really a scalar: we come back to one-dimensional space again! Now, the question we have is: do we really need to go through all the trouble to get this one number? do we really have to go to the m-dimensional space? The answer is no, if you find a clever kernel.

Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (%3Cx, y%3E)^2.

Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then:
f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)
%3Cf(x), f(y)%3E = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024

A lot of algebra. Mainly because f is a mapping from 3-dimensional to 9 dimensional space.

Now let us use the kernel instead: 
K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024
Same result, but this calculation is so much easier.

Additional beauty of Kernel: kernels allow us to do stuff in infinite dimensions! Sometimes going to higher dimension is not just computationally expensive, but also impossible. f(x) can be a mapping from n dimension to infinite dimension which we may have little idea of how to deal with. Then kernel gives us a wonderful shortcut.

Relation to SVM: now how is related to SVM? The idea of SVM is that y = w phi(x) +b, where w is the weight, phi is the feature vector, and b is the bias. if y%3E 0, then we classify datum to class 1, else to class 0. We want to find a set of weight and bias such that the margin is maximized. Previous answers mention that kernel makes data linearly separable for SVM. I think a more precise way to put this is, kernels do not make the data linearly separable. The feature vector phi(x) makes the data linearly separable. Kernel is to make the calculation process faster and easier, especially when the feature vector phi is of very high dimension (for example, x1, x2, x3, ..., x_D^n, x1^2, x2^2, ...., x_D^2).

Why it can also be understood as a measure of similarity: 
if we put the definition of kernel above, %3Cf(x), f(y)%3E, in the context of SVM and feature vectors, it becomes %3Cphi(x), phi(y)%3E. The inner product means the projection of phi(x) onto phi(y). or colloquially, how much overlap do x and y have in their feature space. In other words, how similar they are.

Bharath Hariharan · Answer

Great answers here already, but there are some additional things that I would want to say. So here goes.

What are kernels?
A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are.

Suppose your task is to learn to classify images. You have (image, label) pairs as training data. Consider the typical machine learning pipeline: you take your images, you compute features, you string the features for each image into a vector, and you feed these "feature vectors" and labels into a learning algorithm.

Data --%3E Features --%3E Learning algorithm

Kernels offer an alternative. Instead of defining a slew of features, you define a single kernel function to compute similarity between images. You provide this kernel, together with the images and labels to the learning algorithm, and out comes a classifier.

Of course, the standard SVM/ logistic regression/ perceptron formulation doesn't work with kernels : it works with feature vectors. How on earth do we use kernels then? Two beautiful mathematical facts come to our rescue:
1. Under some conditions, every kernel function can be expressed as a dot product in a (possibly infinite dimensional) feature space ( Mercer's theorem [ http://en.wikipedia.org/wiki/Mercer%27s_theorem ] ).
2. Many machine learning algorithms can be expressed entirely in terms of dot products.
These two facts mean that I can take my favorite machine learning algorithm, express it in terms of dot products, and then since my kernel is also a dot product in some space, replace the dot product by my favorite kernel. Voila!

Why kernels?
Why kernels, as opposed to feature vectors? One big reason is that in many cases, computing the kernel is easy, but computing the feature vector corresponding to the kernel is really really hard. The feature vector for even simple kernels can blow up in size, and for kernels like the RBF kernel ( k(x,y) = exp( -||x-y||^2), see Radial basis function kernel [ http://en.wikipedia.org/wiki/RBF_kernel ]) the corresponding feature vector is infinite dimensional. Yet, computing the kernel is almost trivial.

Many machine learning algorithms can be written to only use dot products, and then we can replace the dot products with kernels. By doing so, we don't have to use the feature vector at all. This means that we can work with highly complex, efficient-to-compute, and yet high performing kernels without ever having to write down the huge and potentially infinite dimensional feature vector. Thus if not for the ability to use kernel functions directly, we would be stuck with relatively low dimensional, low-performance feature vectors. This "trick" is called the kernel trick ( Kernel trick [ http://en.wikipedia.org/wiki/Kernel_trick ] ).

Endnote
I want to clear up two confusions which seem prevalant on this page:
1. A function that transforms one feature vector into a higher dimensional feature vector is not a kernel function. Thus f(x) = [x, x^2 ] is not  a kernel. It is simply a new feature vector. You do not need kernels to do this. You need kernels if you want to do this, or more complicated feature transformations without blowing up dimensionality.
2. A kernel is not restricted to SVMs. Any learning algorithm that only works with dot products can be written down using kernels. The idea of SVMs is beautiful, the kernel trick is beautiful, and convex optimization is beautiful, and they stand quite independent.

Rahul Agarwal · Answer

Found this on Reddit: Please explain Support Vector Machines (SVM) like I am a 5 year old. • /r/MachineLearning [ https://www.reddit.com/r/MachineLearning/comments/15zrpp/please_explain_support_vector_machines_svm_like_i ]

Simply the best explanation of SVM i ever found.
----------------------------------------------------------------------------------------------

%3E We have 2 colors of balls on the table that we want to separate.

We get a stick and put it on the table, this works pretty well right?

Some villain comes and places more balls on the table, it kind of works but one of the balls is on the wrong side and there is probably a better place to put the stick now.

SVMs try to put the stick in the best possible place by having as big a gap on either side of the stick as possible.

Now when the villain returns the stick is still in a pretty good spot.

There is another trick in the SVM toolbox that is even more important. Say the villain has seen how good you are with a stick so he gives you a new challenge.

There’s no stick in the world that will let you split those balls well, so what do you do? You flip the table of course! Throwing the balls into the air. Then, with your pro ninja skills, you grab a sheet of paper and slip it between the balls.

Now, looking at the balls from where the villain is standing, they balls will look split by some curvy line.

Boring adults the call balls data, the stick a classifier, the biggest gap trick optimization, call flipping the table kernelling and the piece of paper a hyperplane.
-----------------------------------------------------------------------------------------------
Now see this:

https://www.youtube.com/watch?v=3liCbRZPrZA
--------------------------------------------------------------------------------------------
One other point that I like to mention about the SVM (unrelated to this question) is how it is defined by the boundary case examples. (taken from CS109 course:SVM Explanation [ https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=92e3adbf-2212-4cff-b1a9-b1bfe72d93bf ])

%3E Assume you want to separate Apples from oranges using SVM. The Red square are Apples and the blue circles are oranges.

Now see the support vectors(The filled Points) which define the margin here.

Intuitively the filled Blue circle is an Orange that looks very much like an Apple. While the filled Red squares are Apples that very much look like oranges.

Think for this another time. If you would want your kid to learn to differentiate between an apple and orange you would show him a perfect apple and a perfect orange. But not SVMs they want to only see an apple that looks like an orange and vice versa. This approach is very different from how most Machine learning algorithms operate, and maybe thats why it works so well in some cases.
If you want to learn more about Data Science and Machine Learning, I would like to call out this excellent course [ https://www.coursera.org/learn/machine-learning?ranMID=40328&ranEAID=lVarvwc5BD0&ranSiteID=lVarvwc5BD0-btd7XBdF681VKxRe2H_Oyg&siteID=lVarvwc5BD0-btd7XBdF681VKxRe2H_Oyg&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=lVarvwc5BD0 ] by Andrew Ng. This was the one that got me started. Do check it out.

Also, I will continue writing beginner friendly data science posts. Follow me up at Medium [ https://medium.com/@rahul_agarwal ] or Subscribe to my blog [ http://eepurl.com/dbQnuX ] to be informed about them.

ChatGPT · Answer

In machine learning, particularly in the context of Support Vector Machines (SVM), kernels are mathematical functions used to enable algorithms to operate in high-dimensional spaces without explicitly transforming the data into those spaces. This is particularly useful for tasks like classification and regression.
What are Kernels?
1. Definition: A kernel is a function that computes the inner product of two vectors in a high-dimensional feature space, allowing SVMs to find hyperplanes that can separate classes effectively.
2. Types of Kernels:
   - Linear Kernel: No transformation; used when the data is linearly separable.
   - Polynomial Kernel: Computes polynomial combinations of the input features, allowing for curved decision boundaries.
   - Radial Basis Function (RBF) Kernel: A popular choice, it maps input features into an infinite-dimensional space, allowing for very flexible decision boundaries.
   - Sigmoid Kernel: Based on the sigmoid function, it can be used in neural networks.
Why Do We Need Kernels?
1. Non-Linearity: Many real-world datasets are not linearly separable in their original space. Kernels allow SVMs to create non-linear decision boundaries by implicitly mapping input features into higher-dimensional spaces.
2. Computational Efficiency: Directly transforming data into high dimensions can be computationally expensive and infeasible. Kernels allow SVMs to operate in these spaces without the need for explicit transformations, using the "kernel trick."
3. Flexibility: Different kernels can be chosen based on the nature of the data and the problem at hand, providing flexibility in model selection and complexity.
4. Improved Performance: By using an appropriate kernel, SVMs can achieve better classification performance, especially in complex datasets with intricate relationships between features.
Summary
In summary, kernels are essential in SVM and other machine learning algorithms for handling non-linear relationships in data efficiently. They enable the creation of flexible models that can adapt to a variety of data distributions, ultimately improving the performance of machine learning tasks.

Prasoon Goyal · Answer

A2A.

Intuitively, a kernel function measures the similarity between two data points. The notion of similarity is task-dependent. So, for instance, if your task is object recognition, then a good kernel will assign a high score to a pair of images that contain the same objects, and a low score to a pair of images with different objects. Note that such a kernel captures much more abstract notion of similarity, compared to a similarity function that just compares two images pixel-by-pixel. As another example, consider a text processing task. There, a good kernel function would assign high score to a pair of similar strings, and low score to a pair of dissimilar strings. This is the advantage of kernel functions — they can, in principle, capture arbitrarily complex notions of similarity, which can be used in various ML algorithms.

Thus, formally, a kernel function [math]K[/math] takes two data points [math]x_{i}[/math] and [math]x_{j}[/math] [math]\in \mathbb{R}^{d}[/math], and produces a score, which is a real number, i.e., [math]K : \mathbb{R}^{d} 	imes \mathbb{R}^{d} ightarrow \mathbb{R}[/math].

In order to use kernel functions in most common ML algorithms, such as SVMs, there are additional requirements on the kernel functions, so that they can fit into the framework. Consider the SVM dual optimization problem:

[math]\max_{\alpha} 1^T\alpha - \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} x_{i}^Tx_{j}[/math]

subject to some constraints that are not important for the current discussion.

Note that the optimal value of [math]\alpha[/math] will depend on [math]x_{i}^T x_{j}[/math] for all pairs of data points [math](x_{i}, x_{j})[/math]. Also, [math]x_{i}^T x_{j}[/math] is a crude similarity measure between [math](x_{i}, x_{j})[/math]. More generally, you may want to map your data to a new space where [math]x_{i}[/math] is mapped to [math]\phi(x_{i})[/math] and [math]x_{j}[/math] is mapped to [math]\phi(x_{j})[/math]. Then, your similarity function should give you [math]\phi(x_{i})^T\phi(x_{j})[/math]. For [math]x_{i}^Tx_{j}[/math], the mapping [math]\phi(\cdot)[/math] is the identity function. [See here for a discussion on mapping in SVMs — Prasoon Goyal's answer to In layman's terms, how does SVM work? [ https://www.quora.com/In-laymans-terms-how-does-SVM-work/answer/Prasoon-Goyal ]]

Hence, defining [math]K(x_{i}, x_{j}) = \phi(x_{i})^T\phi(x_{j})[/math], and putting back in the original optimization problem gives you

[math]\max_{\alpha} 1^T\alpha - \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} K(x_{i}, x_{j})[/math]

Now you can plug in any function [math]\mathbb{R}^{d} 	imes \mathbb{R}^{d} ightarrow \mathbb{R}[/math] as the kernel function in this optimization problem as long as there is some function [math]\phi(\cdot)[/math] for which [math]K(x_{i}, x_{j}) = \phi(x_{i})^T\phi(x_{j})[/math]. Note that you do not need to know what [math]\phi(\cdot)[/math] is; just the knowledge that it exists is sufficient. This is where Mercer’s theorem comes into play — it states that a kernel function [math]K(x_{i}, x_{j})[/math] has a representation [math]\phi(x_{i})^T\phi(x_{j})[/math] for some [math]\phi(\cdot)[/math] if and only if it is positive semi-definite and symmetric. [I’ll skip the technical definitions here to avoid digressing from the main idea of the question; if you have specific queries on this, please post a separate question.]

The above idea is called the Kernel trick — whenever your optimization problem depends on [math]x_{i}, x_{j}[/math] only in the form of inner products [math]x_{i}^T x_{j}[/math], you can replace [math]x_{i}^T x_{j}[/math] by [math]K(x_{i}, x_{j})[/math], where [math]K(\cdot, \cdot)[/math] satisfies Mercer’s theorem.

Note that in the first segment of the answer, we look at what a kernel function is, while in the second segment, we look at the conditions on kernel functions that can be used in algorithms like SVMs. It is important to note that a function [math]K(x_{i}, x_{j})[/math] is also a kernel function, even if it is not positive semi-definite and symmetric, just that it cannot be used in algorithms which require these conditions. There are algorithms which can make use of such kernels, for instance, see Learning with Non-Positive Kernels [ http://kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/pdfs/pdf3416.pdf ], although, these methods are relatively less common.

Soumendra Prasad Dhanee · Answer

1. For a dataset with n features (~n-dimensional), SVMs find an n-1 dimensional hyperplane to separate it (let us say for classification)
2. Thus, SVMs perform very badly with datasets that are not linearly separable
3. But, quite often, it’s possible to transform our not-linearly-separable dataset into a higher-dimensional dataset where it becomes linearly separable, so that SVMs can do a good job
4. Unfortunately, quite often, the number of dimensions you have to add (via transformations) depends on the number of dimensions you already have (and not linearly :( )
5. 
1. For datasets with a lot of features, it becomes next to impossible to try out all the interesting transformations

6. Enter The Kernel Trick
7. 
1. Thankfully, the only thing SVMs need to do in the (higher-dimensional) feature space (while training) is computing the pair-wise dot products
2. For a given pair of vectors (in a lower-dimensional feature space) and a transformation into a higher-dimensional space, there exists a function (The Kernel Function) which can compute the dot product in the higher-dimensional space without explicitly transforming the vectors into the higher-dimensional space first
3. We are saved!

8. SVM can now do well with datasets that are not linearly-separable

Hasan Poonawala · Answer

SVMs depend on two ideas: VC dimension and optimization. Given points in the plane and the fact that they are separable (Nikhil's answer), there are infinite lines (halfplanes in 2D) that would separate them. The SVM finds the best one by solving an optimization problem on those separating halfplanes. The nice thing is that it is a Linear Programming problem which is fast and easy to solve.

The VC dimension is related to a fact we took for granted: are the training points separable? Well, when it comes to points in 2D, the maximum number of points that are guaranteed to be separable are 3, whic

The VC dimension is related to a fact we took for granted: are the training points separable? Well, when it comes to points in 2D, the maximum number of points that are guaranteed to be separable are 3, which is precisely the VC dimension. Note the diagram in Nikhil's answer projects the three points on a line to 2D, and he drew a classifier that works. The VC dimension depends on the classifiers (hyperplanes in the case of SVMs) and the dimension which are data lies in (2D in this case). Now if the points are in n-dimensions, the VC dimention is n+1. So instead, if we have m points, we need the points to lie in a space of dimension at least (m-1). When this happens, we are sure to find a solution to the optimization we perform. This is where the kernel comes in. We project the m points lying in n-dimension, n%3C (m-1), into an (m-1) dimensional space so that we are guaranteed to find a linear classifier in that dimension.

To make it concrete, as in Nikhil's example, we had three points in 1 dimension. Not solvable. So we project it into a space of at least (3-1)=2 dimensions, and we will find a solution.

Edit:
As Yan King Yin points out, three points in a plane cannot be separated if they are collinear. VC dimension excludes these cases (measure zero sets), otherwise half planes couldn't separate anything.

Edit 2:

Since people are still looking at this answer, and the Kernels part is hidden in a comment, I’m pulling it into the answer:

We have a set of data-points that cannot be linearly separated (related to VC dimension of half-planes). Using a nonlinear classifier to separate them is painful.

The linear SVM algorithm really only needs to take the inner product of the points it's trying to classify. This is because half-planes can be determined using inner products, and SVMs are looking for half-planes.

So we want to do two things: we want to map all the data points from the space they come into a useful inner product space, which will usually be of higher dimension (Usefulness is related to separability related to VC dimension). Then we want to use a Linear SVM to classify the points in the this space using a unique half-plane. This means we want to take inner products of the transformed data points.

A kernel function does precisely this, but in one single step. You do not bother with finding the transformed points, since you already know what their dot product will be thanks to the kernel function.

This is the kernel trick: use a kernel function to convert a nonlinear classification into a linear one (if possible).

First off you save a HUGE amount of time because this is a linear program and not some nonlinear optimization because the points are weird. 
Second, you save time and space by not mapping the points into the transformed space, but directly going from two points in the original space to the dot product in the higher dimensional space.

The linear classification performed actually becomes a nonlinear classification in the original space

Shashank Gupta · Answer

SVM in it's general setting works by finding an "optimal" hyper plane which separates two point clouds of data.  But this is limited in the sense that it only works when the two point clouds (each corresponding to a class) can be separated by a hyper plane. What if the separating boundary is non-linear ?

This is some image I found from web. This is a perfect example of the limitation of SVM in it's general setting. SVM will learn the Red colored line from the data points which we can clearly see is not optimal. The optimal boundary is the Green curve which is a non-linear structure.

To overcome this limitation we use the concept of kernels. Given a test point kernels will fit a curve on data points, specifically to points that are 'close' to the test set. So if using an RBF kernel it will fit a gaussian distribution over 'nearby' points of test point. Closeness is defined using std. distance metric.

Suppose this is the function we are trying to approximate. Yellow region is the gaussian that it fit over the training points points that are closed to test point(assume it is one in the center of the gaussian). So it approximates the function (separating boundary) by using piece-wise gaussians (for RBF kernel). And it uses only sparse set from training data(support vectors) to do so.

This is the intuition of using kernels in SVM. Of course there are nice theoretical arguments for using kernels like projection of data points in Infinite Dimensional space where it becomes linearly seperable, but this is the intuition that I have for using kernels. HTH

Gillis Danielsen · Answer

Here is an even shorter visualization of what an SVM does. Basically the trick is to to look for a projection to higher dimensional space where there is a linear separation of the data (others have posted much more detailed and correct answers so I just wanted to share the vid, not made by me).

http://www.youtube.com/watch?v=3liCbRZPrZA

Omar R · Answer

Part 1. Why do we need Kernals ?

Answer: Assuming that you have heard of Curse of Dimensionality, there is also a concept of Blessing of Dimensionality which says that in higher dimensions the structure of data becomes easier to analyse than in lower dimensions. In other words, we assume that in a classification problem if different classes can not be separated by a linear boundary, then, if we project the data into higher dimensional space, a linear decision boundary may be achievable. Consider the image below (from google images) to get a clear idea of how a nonlinear decision boundary can be converted to a linear decision boundary by projecting the data into higher dimensional space.
A Kernal may be used to achieve this transformation.

Part 2: What are Kernals?

Answer:

* Consider two points/vectors, [math]X[/math] and [math]Y[/math] in the given d dimensional space (say d = 2)
 * Consider a mapping [math]phi[/math], which transforms [math]x[/math] and [math]y[/math] into higher dimensions (say 3). [math]phi[/math] is upto you to choose
 * Consider a Kernal [math]K [/math]. Note that [math]K[/math] lies in the original (2 dim) space.
Then the way to calculate [math]K[/math] is:

1. Find [math]phi(X)[/math] and [math]phi(Y)[/math].
2. Find their dot product i.e. [math](%3Cphi(X),phi(Y)%3E)[/math]
3. [math](%3Cphi(X),phi(Y)%3E)[/math] will give us [math]K[/math].
A kernal is a general concept and can be used for many algorithms which are linear in nature when the data is ‘non-linear’.

Abhishek Shivkumar · Answer

I know that pasting a link to an external talk would not be an appropriate answer here, but this talk is just so awesome and answers your question so intuitively that I encourage you to watch it

http://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CFgQFjAA&url=http%3A%2F%2Fvideolectures.net%2Fepsrcws08_campbell_isvm%2F&ei=dAcuUNSrJsyIrAe9j4GQCw&usg=AFQjCNGtNyCaezPFrgGqmS7q2j3ffRsXVQ

Sreejith Menon · Answer

Another famous question that I have encountered which is related to the topic is the computational complexity for predictions using SVM with and without the kernel trick.

Say the original feature space has |X| features. The new feature space has |F| dimensions. A kernel function takes k time to evaluate. Further there are m training examples and |S| support vectors.

Without using the kernel trick: The time taken would be O(|F|).

Explanation: For one testing example we would have to actually compute the term sign(w0 + w*x) which would take time O(|F|).

With the kernel trick: Time taken would be O(k*|S|)

Explanation: We would simply compute the dot product of testing example with that of the support vectors. The number of support vectors times the time taken to execute the kernel function once would be the computational complexity.

Please correct my answer if I am wrong here.

Gopal Malakar · Answer

This is one of the easiest explanation of SVM and Kernel trick.

https://www.youtube.com/watch?v=ikt7Qze0czE

Ashkon Farhangi · Answer

To add to the other answers, an RBF kernel, perhaps the mostly commonly used kernel, acts essentially as a low band pass filter that prefers smoother models. For a full mathematical justification of this fact check out Charles Martin's explanation. [ https://charlesmartin14.wordpress.com/2012/02/06/kernels_part_1/ ]

Balaji Pitchai Kannu · Answer

It is not possible to find a hyperplane or a linear decision boundary for some classification problems. If we project the data in to a higher dimension from the original space, we may get a hyperplane in the projected dimension that helps to classify the data.

As we shown in the above figure, it is impossible to find a single line to separate the two classes (green and blue) in the input space. But

As we shown in the above figure, it is impossible to find a single line to separate the two classes (green and blue) in the input space. But, after projecting the data in to a higher dimension (i.e. feature space in the figure), we could able to find the hyperplane which classifies the data. Kernel helps to find a hyperplane in the higher dimensional space without increasing the computational cost much. Usually, the computational cost will increase, if the dimension of the data increases.

How come Kernel doesn’t increase the computational complexity?

We knew that the dot product of same dimensional two vectors gives a single number. Kernel utilizes this property to compute the dot product in a different space without even visiting the space.

Assume that, we have two features. It means that dimension of the data point is [math]\mathbb R^2. [/math]

[math]x_{i} = \begin{bmatrix} x_{i1} \ x_{i2} \end{bmatrix} 	ag*{}[/math]

i in the x subscript represent the data points. Similarly, 1 and 2 in the x subscript denote the features. Also, assume that we are applying some transformation function to convert two dimensional input space (two features) in to four dimensional feature space which is [math](x_{i1}^{2}, x_{i1} x_{i2}, x_{i2} x_{i1}, x_{i2}^{2}).[/math] It requires [math]\mathcal{O}(n^{2})[/math] time to calculate n data points in the four dimensional space. To calculate the dot product of two vectors in the four dimensional space/transformed space, the standard way is

1. Convert each data point from [math] \mathbb R^2 	o \mathbb R^4 [/math]by applying the transformation. (I have taken two data points [math]x_{i}[/math] and [math]x_{j}[/math])

[math]\phi(x_{i}) = \begin{bmatrix} x_{i1}^{2}\ x_{i1} x_{i2} \ x_{i2} x_{i1}\ x_{i2}^{2} \end{bmatrix} \hspace{2cm} \phi(x_{j}) = \begin{bmatrix} x_{j1}^{2}\ x_{j1} x_{j2} \ x_{j2} x_{j1}\ x_{j2}^{2} \end{bmatrix}[/math]

2. Dot product the two vectors.

[math]\phi(x_{i}).\phi(x_{j}) 	ag*{}[/math]

As I said before, Kernel function calculates the dot product in the different space without even visiting it. Kernel function for the above transformation is

[math]K(x_{i},x_{j}) = (x_{i}^{T}x_{j})^{2} 	ag{1}[/math]

Example:

Let say [math]x_{i} = \begin{bmatrix} 1 \ 2 \end{bmatrix} \hspace{0.5cm}	ext{and} \hspace{0.5cm}x_{j} = \begin{bmatrix} 3 \ 5 \end{bmatrix}.[/math]

The dot product in the four dimensional space by the standard way is

[math]= \begin{bmatrix} 1 \ 2 \ 2 \ 4 \end{bmatrix} \cdot \begin{bmatrix} 9\ 15 \ 15 \ 25 \end{bmatrix} = 9+30+30+100 = 169 	ag*{}[/math]

The above dot product can be calculated using the above kernel function (equation 1) without even transforming the original space.

[math] K(x_{i},x_{j}) = (\begin{bmatrix} 1 \ 2 \end{bmatrix} ^{T} \cdot \begin{bmatrix} 3 \ 5 \end{bmatrix})^{2} = (3+10)^{2} =169.[/math]

The standard method of calculating the dot product requires [math]\mathcal{O}(n^{2}) [/math]time. But, kernel requires just [math]\mathcal{O}(n)[/math] time.

Kernel can also be considered as a similarity measure. Similarity measure quantifies the similarity between two data points. Kernel value will be large, if [math]\phi(x_{i})[/math] and [math]\phi(x_{j})[/math] are close together. It will be zero, if they are orthogonal to each other. Kernel can’t be applied to all machine learning algorithms. It will be applied, only if the algorithm only needs to know the inner product of the data points in the coordinate space. The dual form of SVM’s objective function and its constraints are given below

[math]L_{D} =\sum_{i=1}^{N} \alpha_{i} - \frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{i}\alpha_{k} y_{i}y_{k} \underset{	ext{Inner product}}{(x_{i}^{T}x_{j})} 	ag{2}[/math]

[math] s.t: \alpha_{i} \geq 0 	ag*{}[/math]

[math] \hspace{0.5cm} \sum_{i=1}^{N}\alpha_{i}\ y_{i} = 0 	ag*{}[/math]

It just requires the inner product of the data points in the coordinate space to determine the separating hyperplane. So, we can replace that by kernel.

[math]L_{D} =\sum_{i=1}^{N} \alpha_{i} - \frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{i}\alpha_{k} y_{i}y_{k} K(x_{i}^{T}x_{j}) 	ag*{}[/math]

Sebastian Raschka · Answer

Given an arbitrary dataset, you typically don't know which kernel may work best. I recommend starting with the simplest hypothesis space first -- given that you don't know much about your data -- and work your way up towards the more complex hypothesis spaces.

So, the linear kernel works fine if your dataset if linearly separable; however, if your dataset isn't linearly separable, a linear kernel isn't going to cut it (almost in a literal sense ;)).

For simplicity (and visualization purposes), let's assume our dataset consists of 2 dimensions only. Below, I plotted the decision regions of a linear SVM on 2 features of the iris dataset:

This works perfectly fine. And here comes the RBF kernel SVM :

Now, it looks like both linear and RBF kernel SVM would work equally well on this dataset. So, why prefer the simpler, linear hypothesis? Think of Occam's Razor in this particular case. Linear SVM is a parametric model, an RBF kernel SVM isn't, and the complexity of the latter grows with the size of the training set. Not only is it more expensive to train a RBF kernel SVM, but you also have to keep the kernel matrix around, and the projection into this "infinite" higher dimensional space were the data becomes linearly separable is more expensive as well during prediction. Furthermore, you have more hyperparameters to tune, so model selection is more expensive as well! And finally, it's much easier to overfit a complex model!

Okay, what I've said above sounds all very negative regarding kernel methods, but it really depends on the dataset. E.g., if your data is not linearly separable, it doesn't make sense to use a linear classifier:

In this case, a RBF kernel would make so much more sense:

In any case, I wouldn't bother too much about the polynomial kernel. In practice, it is less useful for efficiency (computational as well as predictive) performance reasons. So, the rule of thumb is: use linear SVMs (or logistic regression) for linear problems, and nonlinear kernels such as the Radial Basis Function kernel for non-linear problems.

Btw. the RBF kernel SVM decision region is actually also a linear decision region. What RBF kernel SVM actually does is to create non-linear combinations of your features to uplift your samples onto a higher-dimensional feature space where you can use a linear decision boundary to separate your classes:

Okay, above, I walked you through an intuitive example where we can visualize our data in 2 dimensions ... but what do we do in a real-world problem, i.e., a dataset with more than 2 dimensions? Here, we want to keep an eye on our objective function: minimizing the hinge-loss. We would setup a hyperparameter search (grid search, for example) and compare different kernels to each other. Based on the loss function (or a performance metric such as accuracy, F1, MCC, ROC auc, etc.) we could determine which kernel is "appropriate" for the given task. I've some more posts here if it helps:  How do I evaluate a model? [ http://sebastianraschka.com/faq/docs/evaluate-a-model.html ] |  A Basic Pipeline and Grid Search Setup via scikit-learn: Jupyter Notebook Viewer [ http://nbviewer.jupyter.org/github/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb ]

Prasoon Goyal · Answer

A2A.

As others have pointed out, there’s no way to figure out which kernel would do the best for a particular problem. The only way to choose the best kernel is to actually try out all possible kernels, and choose the one that does the best empirically. However, we can still look at some differences between various kernel functions, to have some rules of thumb.

Let’s start by listing the kernel functions:

* Linear: [math]K(x, y) = x^Ty[/math]
 * Polynomial: [math]K(x, y) = (x^Ty + 1)^d[/math]
 * Sigmoid: [math]K(x, y) = tanh(a x^Ty + b)[/math]
 * RBF: [math]K(x, y) = \exp(-\gamma \| x - y\|^2)[/math]
Now, let’s look at some differences:

* Translation invariance: RBF kernel is the only kernel out of the above that is translation invariant, that is, [math]K(x, y) = K(x + t, y + t)[/math], where t is any arbitrary vector. Intuitively, this property is useful — if you imagine all your data lying in some space, then the similarity between the points should not change if you shift the entire data, without changing the relative positions of the points.
 * Inner product vs Euclidean distance: Related to the above point, RBF kernel is a function of the Euclidean distance between the points, whereas all other kernels are functions of inner product of the points. Again, it makes more intuitive sense to have Euclidean distance — points that are closer should be more similar. If two points are close to the origin, but on opposite sides, then the inner product based kernels assign the pair a low value, but Euclidean distance based kernels assign the pair a high value. It is, however, important to note that for some applications, inner product is sometimes the more preferred similarity metric, like in bag-of-words vectors, because you care more about the direction of the vectors (which words appear in both the document vectors) rather than the actual counts.
 * Normalized: A kernel is said to be normalized if [math]K(x, x) = 1[/math] for all [math]x[/math]. This is true for only RBF kernel in the above list. Again, intuitively, you want this property to hold — if [math]x[/math] and [math]x[/math] have a similarity of [math]\lambda[/math], then [math]2x[/math] and [math]2x[/math] should also have a similarity of [math]\lambda[/math]. You can convert an arbitrary kernel [math]K(x, y)[/math] to a normalized kernel [math]	ilde{K}(x, y)[/math] by defining [math]	ilde{K}(x, y) = \dfrac{K(x, y)}{\sqrt{K(x, x)} \sqrt{K(y, y)}}[/math]. (Also, as a side note, RBF kernel is the normalized kernel for the exponential kernel, [math]K(x, y) = \exp(x^Ty)[/math].)
These properties tend to make RBF kernel better in general, for most problems. And because it does the best empirically, it tends to be most widely used. However, just to reiterate, depending on the nature of the problem, it is possible that one of the other kernels does better than RBF kernel.

Anonymous · Answer

Whatever you want it to be.

You can use whatever inner product [1] you want to use, and you will get different results accordingly. Most software packages will just use the standard dot product unless you specify a different kernel should be used.

If you want to know how the kernel influences the model, you really should read tutorials on the mathematics behind SVM. I’ll try to summarize it, but you should realize that this is a very rough summary:

We want to find the best hyperplane that separates the two classes of our data. Here “best” means: ideally it has all the positive examples on one side, and all negative examples on the other side; and the samples closest to the line are not too close — we want the maximum margin. If such a line (hyperplane) does not exist, we want the wrongly labeled samples to lie close to the line.

Now, instead of just using a hyperplane in our original space, we could project our data into a higher-dimensional space. For examples, if our data has [math]x[/math] and [math]y[/math] coordinates, we can project each point to [math](x^2, y^2, xy, x, y)[/math] in 5-dimensional space. This is really cool: a hyperplane in this space corresponds to any conic section we want, in our original space. So now we can have models that say “every point within this particular ellipse is positive, everything else is negative”.

It would be annoying if we manually have to do the projections for all the points. That’s where the kernel comes in: we don’t need to do that. We just need the corresponding kernel function that, for two given points, gives us the dot product in our projected 5-dimensional space. That’s what we call the “kernel trick”.

In fact, we do not even really care about what space we are mapping to. As long as the kernel function is a valid inner product (see Inner product space - Wikipedia [ https://en.wikipedia.org/wiki/Inner_product_space ]) we’re good to go. Sometimes this even corresponds to a mapping to an infinite-dimensional space.

1. Inner product space - Wikipedia [ https://en.wikipedia.org/wiki/Inner_product_space ]

Balaji Pitchai Kannu · Answer

It is not possible to find a hyperplane or a linear decision boundary for some classification problems. If we project the data in to a higher dimension from the original space, we may get a hyperplane in the projected dimension that helps to classify the data.