Sort
Profile photo for Quora User

Briefly speaking, a kernel is a shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.

Mathematical definition: K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.

Intuition: normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional spac

Briefly speaking, a kernel is a shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.

Mathematical definition: K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.

Intuition: normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional space, where m can be a large number. But after all the trouble of going to the high dimensional space, the result of the dot product is really a scalar: we come back to one-dimensional space again! Now, the question we have is: do we really need to go through all the trouble to get this one number? do we really have to go to the m-dimensional space? The answer is no, if you find a clever kernel.

Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2.

Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then:
f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)
<f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024

A lot of algebra. Mainly because f is a mapping from 3-dimensional to 9 dimensional space.

Now let us use the kernel instead:
K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024
Same result, but this calculation is so much easier.

Additional beauty of Kernel: kernels allow us to do stuff in infinite dimensions! Sometimes going to higher dimension is not just computationally expensive, but also impossible. f(x) can be a mapping from n dimension to infinite dimension which we may have little idea of how to deal with. Then kernel gives us a wonderful shortcut.

Relation to SVM: now how is related to SVM? The idea of SVM is that y = w phi(x) +b, where w is the weight, phi is the feature vector, and b is the bias. if y> 0, then we classify datum to class 1, else to class 0. We want to find a set of weight and bias such that the margin is maximized. Previous answers mention that kernel makes data linearly separable for SVM. I think a more precise way to put this is, kernels do not make the data linearly separable. The feature vector phi(x) makes the data linearly separable. Kernel is to make the calculation process faster and easier, especially when the feature vector phi is of very high dimension (for example, x1, x2, x3, ..., x_D^n, x1^2, x2^2, ...., x_D^2).

Why it can also be understood as a measure of similarity:
if we put the definition of kernel above, <f(x), f(y)>, in the context of SVM and feature vectors, it becomes <phi(x), phi(y)>. The inner product means the projection of phi(x) onto phi(y). or colloquially, how much overlap do x and y have in their feature space. In other words, how similar they are.

Profile photo for Bharath Hariharan

Great answers here already, but there are some additional things that I would want to say. So here goes.

What are kernels?
A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are.

Suppose your task is to learn to classify images. You have (image, label) pairs as training data. Consider the typical machine learning pipeline: you take your images, you compute features, you string the features for each image into a vector, and you feed these "feature vectors" and labels into

Great answers here already, but there are some additional things that I would want to say. So here goes.

What are kernels?
A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are.

Suppose your task is to learn to classify images. You have (image, label) pairs as training data. Consider the typical machine learning pipeline: you take your images, you compute features, you string the features for each image into a vector, and you feed these "feature vectors" and labels into a learning algorithm.

Data --> Features --> Learning algorithm

Kernels offer an alternative. Instead of defining a slew of features, you define a single kernel function to compute similarity between images. You provide this kernel, together with the images and labels to the learning algorithm, and out comes a classifier.

Of course, the standard SVM/ logistic regression/ perceptron formulation doesn't work with kernels : it works with feature vectors. How on earth do we use kernels then? Two beautiful mathematical facts come to our rescue:

  1. Under some conditions, every kernel function can be expressed as a dot product in a (possibly infinite dimensional) feature space ( Mercer's theorem ).
  2. Many machine learning algorithms can be expressed entirely in terms of dot products.

These two facts mean that I can take my favorite machine learning algorithm, express it in terms of dot products, and then since my kernel is also a dot product in some space, replace the dot product by my favorite kernel. Voila!

Why kernels?
Why kernels, as opposed to feature vectors? One big reason is that in many cases, computing the kernel is easy, but computing the feature vector corresponding to the kernel is really really hard. The feature vector for even simple kernels can blow up in size, and for kernels like the RBF kernel ( k(x,y) = exp( -||x-y||^2), see
Radial basis function kernel) the corresponding feature vector is infinite dimensional. Yet, computing the kernel is almost trivial.

Many machine learning algorithms can be written to only use dot products, and then we can replace the dot products with kernels. By doing so, we don't have to use the feature vector at all. This means that we can work with highly complex, efficient-to-compute, and yet high performing kernels without ever having to write down the huge and potentially infinite dimensional feature vector. Thus if not for the ability to use kernel functions directly, we would be stuck with relatively low dimensional, low-performance feature vectors. This "trick" is called the kernel trick ( Kernel trick ).

Endnote
I want to clear up two confusions which seem prevalant on this page:

  1. A function that transforms one feature vector into a higher dimensional feature vector is not a kernel function. Thus f(x) = [x, x^2 ] is not a kernel. It is simply a new feature vector. You do not need kernels to do this. You need kernels if you want to do this, or more complicated feature transformations without blowing up dimensionality.
  2. A kernel is not restricted to SVMs. Any learning algorithm that only works with dot products can be written down using kernels. The idea of SVMs is beautiful, the kernel trick is beautiful, and convex optimization is beautiful, and they stand quite independent.
Profile photo for MATLAB

Machine learning operations, or MLOps, is a set of practices that guides organizations. MLOps provides guidelines on the full lifecycle of machine learning models. It links the design, build, and test activities of typical machine learning with the deploy, maintain, and monitor activities of operations in a continuous feedback loop.

More organizations are incorporating machine learning for data-driven and technology-driven applications. There are many moving parts and teams involved in taking machine learning models into production and maintaining them. MLOps can bring to your organization auto

Machine learning operations, or MLOps, is a set of practices that guides organizations. MLOps provides guidelines on the full lifecycle of machine learning models. It links the design, build, and test activities of typical machine learning with the deploy, maintain, and monitor activities of operations in a continuous feedback loop.

More organizations are incorporating machine learning for data-driven and technology-driven applications. There are many moving parts and teams involved in taking machine learning models into production and maintaining them. MLOps can bring to your organization automation of the machine learning lifecycle, and facilitate collaboration between teams.

Check out this video to learn what MLOps is, how to integrate it into development and operations, and what the key benefits of MLOps are.

Profile photo for Quora User

Intuitively, a kernel is just a transformation of your input data that allows you (or an algorithm like SVMs) to treat/process it more easily. Imagine that we have the toy problem of separating the red circles from the blue crosses on a plane as shown below.

Our separating surface would be the ellipse drawn on the left figure. However, transforming our data into a 3 dimensional space through the mapping shown in the figure would make the problem much easier since, now, our points are separated by a simple plane. This embedding on a higher dimension is called the kernel trick.

In conclusion, an

Intuitively, a kernel is just a transformation of your input data that allows you (or an algorithm like SVMs) to treat/process it more easily. Imagine that we have the toy problem of separating the red circles from the blue crosses on a plane as shown below.

Our separating surface would be the ellipse drawn on the left figure. However, transforming our data into a 3 dimensional space through the mapping shown in the figure would make the problem much easier since, now, our points are separated by a simple plane. This embedding on a higher dimension is called the kernel trick.

In conclusion, and very informally, a kernel consists on embedding general points into an inner product space.


PS: I have taken this graph from http://www.sussex.ac.uk/Users/christ/crs/ml/lec08a.html, which is also shown in Hastie and Tibshirani: Elements of Statistical Learning.

Profile photo for Rahul Agarwal

Found this on Reddit: Please explain Support Vector Machines (SVM) like I am a 5 year old. • /r/MachineLearning [ https://www.reddit.com/r/MachineLearning/comments/15zrpp/please_explain_support_vector_machines_svm_like_i ]

Simply the best explanation of SVM i ever found.
----------------------------------------------------------------------------------------------


> We have 2 colors of balls on the

Found this on Reddit: Please explain Support Vector Machines (SVM) like I am a 5 year old. • /r/MachineLearning [ https://www.reddit.com/r/MachineLearning/comments/15zrpp/please_explain_support_vector_machines_svm_like_i ]

Simply the best explanation of SVM i ever found.
----------------------------------------------------------------------------------------------


> We have 2 colors of balls on the table that we want to separate.

We get a stick and put it on the table, this works pretty well right?

Some villain comes and places more balls on the table, it kind of works but one of the balls is on the wrong side and there is probably a better place to put the stick now.

SVMs try to put the stick in the best possible place by having as big a gap on either side of the stick as possible.

Now when the villain returns the stick is still in a pretty good spot.

There is another trick in the SVM toolbox that is even more important. Say the villain has seen how good you are with a stick so he gives you a new challenge.

There’s no stick in the world that will let you split those balls well, so what do you do? You flip the table of course! Throwing the balls into the air. Then, with your pro ninja skills, you grab a sheet of paper and slip it between the balls.

Now, looking at the balls from where the villain is standing, they balls will look split by some curvy line.

Boring adults the call balls data, the stick a classifier, the biggest gap trick optimization, call flipping the table kernelling and the piece of paper a hyperplane.
-----------------------------------------------------------------------------------------------

Now see this:

--------------------------------------------------------------------------------------------
One other point that I like to mention about the SVM (unrelated to this question) is how it is defined by the boundary case examples. (taken from CS109 c...

Profile photo for Assistant
Assistant

In machine learning, particularly in the context of Support Vector Machines (SVM), kernels are mathematical functions used to enable algorithms to operate in high-dimensional spaces without explicitly transforming the data into those spaces. This is particularly useful for tasks like classification and regression.

What are Kernels?

  1. Definition: A kernel is a function that computes the inner product of two vectors in a high-dimensional feature space, allowing SVMs to find hyperplanes that can separate classes effectively.
  2. Types of Kernels:
    -
    Linear Kernel: No transformation; used when the data i

In machine learning, particularly in the context of Support Vector Machines (SVM), kernels are mathematical functions used to enable algorithms to operate in high-dimensional spaces without explicitly transforming the data into those spaces. This is particularly useful for tasks like classification and regression.

What are Kernels?

  1. Definition: A kernel is a function that computes the inner product of two vectors in a high-dimensional feature space, allowing SVMs to find hyperplanes that can separate classes effectively.
  2. Types of Kernels:
    -
    Linear Kernel: No transformation; used when the data is linearly separable.
    -
    Polynomial Kernel: Computes polynomial combinations of the input features, allowing for curved decision boundaries.
    -
    Radial Basis Function (RBF) Kernel: A popular choice, it maps input features into an infinite-dimensional space, allowing for very flexible decision boundaries.
    -
    Sigmoid Kernel: Based on the sigmoid function, it can be used in neural networks.

Why Do We Need Kernels?

  1. Non-Linearity: Many real-world datasets are not linearly separable in their original space. Kernels allow SVMs to create non-linear decision boundaries by implicitly mapping input features into higher-dimensional spaces.
  2. Computational Efficiency: Directly transforming data into high dimensions can be computationally expensive and infeasible. Kernels allow SVMs to operate in these spaces without the need for explicit transformations, using the "kernel trick."
  3. Flexibility: Different kernels can be chosen based on the nature of the data and the problem at hand, providing flexibility in model selection and complexity.
  4. Improved Performance: By using an appropriate kernel, SVMs can achieve better classification performance, especially in complex datasets with intricate relationships between features.

Summary

In summary, kernels are essential in SVM and other machine learning algorithms for handling non-linear relationships in data efficiently. They enable the creation of flexible models that can adapt to a variety of data distributions, ultimately improving the performance of machine learning tasks.

Profile photo for Clayton Thomas

Like many of you reading this, I’ve been looking for ways to earn money online in addition to my part-time job. But you know how it is – the internet is full of scams and shady-grady stuff, so I spent weeks trying to find something legit. And I finally did!

Freecash surprised me in all the right ways. I’ve earned over $1,350 in one month without ‘living’ on the platform. I was skeptical right up until the moment I cashed out to my PayPal.

What is Freecash all about?

Basically, it’s a platform that pays you for testing apps and games and completing surveys. This helps developers improve their appl

Like many of you reading this, I’ve been looking for ways to earn money online in addition to my part-time job. But you know how it is – the internet is full of scams and shady-grady stuff, so I spent weeks trying to find something legit. And I finally did!

Freecash surprised me in all the right ways. I’ve earned over $1,350 in one month without ‘living’ on the platform. I was skeptical right up until the moment I cashed out to my PayPal.

What is Freecash all about?

Basically, it’s a platform that pays you for testing apps and games and completing surveys. This helps developers improve their applications while you make some money.

  • You can earn by downloading apps, testing games, or completing surveys. I love playing games, so that’s where most of my earnings came from (oh, and my favorites were Warpath, Wild Fish, and Domino Dreams).
  • There’s a variety of offers (usually, the higher-paying ones take more time).
  • Some games can pay up to $1,000 for completing a task, but these typically require more hours to finish.
  • On average, you can easily earn $30-60/day.
  • You pick your options — you’re free to choose whatever apps, games, and surveys you like.

Of course, it’s not like you can spend 5 minutes a day and become a millionaire. But you can build a stable income in reasonable time, especially if you turn it into a daily habit.

Why did I like Freecash?

  • It’s easy. I mean it. You don’t have to do anything complicated. All you need is to follow the task and have some free time to spend on it. For some reason, I especially enjoyed the game Domino Dreams. My initial goal was to complete chapter 10 to get my first $30, but I couldn’t stop playing and ended up completing chapter 15. It was lots of fun and also free money: $400 from that game alone.
  • No experience needed. Even if you’ve never done any ‘testing’ before, you can do this. You get straightforward task descriptions, so it’s impossible to go wrong. A task you might expect is something like: Download this game and complete all challenges in 14 days.
  • You can do it from anywhere. I was earning money while taking the bus, chilling on the couch, and during my breaks.
  • Fast cashing out. I had my earnings in my PayPal account in less than 1 day. I’m not sure how long it takes for other withdrawal methods (crypto, gift cards, etc.), but it should be fast as well.
  • You can earn a lot if you’re consistent. I’ve literally seen users in the Leaderboard making $3,000 in just one month. Of course, to get there, you need time, but making a couple of hundred dollars is really easy and relatively fast for anyone.

Don’t miss these PRO tips to earn more:

I feel like most users don’t know about these additional ways to make more money with Freecash:

  • Free promo codes: You can follow Freecash on social media to get weekly promo codes for free coins, which you can later exchange for money.
  • Daily rewards and bonuses: If you use the platform daily, you’ll get additional bonuses that help you earn more.
  • In-app purchases to speed up processes: While playing, you can buy items to help speed up task completion. It’s optional, but it really saved me time, and I earned 4x more than I spent.
  • Choose the highest-paying offers: Check New Offers and Featured Offers to get the best opportunities that pay the most.

Honestly, I still can’t believe I was able to earn this much so easily. And I’ve actually enjoyed the whole process. So, if you’re looking for some truly legit ways to earn money online, Freecash is a very good option.

Profile photo for Prasoon Goyal

A2A.

Intuitively, a kernel function measures the similarity between two data points. The notion of similarity is task-dependent. So, for instance, if your task is object recognition, then a good kernel will assign a high score to a pair of images that contain the same objects, and a low score to a pair of images with different objects. Note that such a kernel captures much more abstract notion of similarity, compared to a similarity function that just compares two images pixel-by-pixel. As another example, consider a text processing task. There, a good kernel function would assign high score to

A2A.

Intuitively, a kernel function measures the similarity between two data points. The notion of similarity is task-dependent. So, for instance, if your task is object recognition, then a good kernel will assign a high score to a pair of images that contain the same objects, and a low score to a pair of images with different objects. Note that such a kernel captures much more abstract notion of similarity, compared to a similarity function that just compares two images pixel-by-pixel. As another example, consider a text processing task. There, a good kernel function would assign high score to a pair of similar strings, and low score to a pair of dissimilar strings. This is the advantage of kernel functions — they can, in principle, capture arbitrarily complex notions of similarity, which can be used in various ML algorithms.

Thus, formally, a kernel function [math]K[/math] takes two data points [math]x_{i}[/math] and [math]x_{j}[/math] [math]\in \mathbb{R}^{d}[/math], and produces a score, which is a real number, i.e., [math]K : \mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}[/math].


In order to use kernel functions in most common ML algorithms, such as SVMs, there are additional requirements on the kernel functions, so that they can fit into the framework. Consider the SVM dual optimization problem:

[math]\max_{\alpha} 1^T\alpha - \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} x_{i}^Tx_{j}[/math]

subject to some constraints that are not important for the current discussion.

Note that the optimal value of [math]\alpha[/math] will depend on [math]x_{i}^T x_{j}[/math] for all pairs of data points [math](x_{i}, x_{j})[/math]. Also, [math]x_{i}^T x_{j}[/math] is a crude similarity measure between [math](x_{i}, x_{j})[/math]. More generally, you may want to map your data to a new space where [math]x_{i}[/math] is mapped to [math]\phi(x_{i})[/math] and [math]x_{j}[/math] is mapped to [math]\phi(x_{j})[/math]. Then, your similarity function should give you [math]\phi(x_{i})^T\phi(x_{j})[/math]. For [math]x_{i}^Tx_{j}[/math], the mapping [math]\phi(\cdot)[/math] is the identity function. [See here for a discussion on mapping in SVMs — Prasoon Goyal's answer to In layman's terms, how does SVM work?]

Hence, defining [math]K(x_{i}, x_{j}) = \phi(x_{i})^T\phi(x_{j})[/math], and putting back in the original optimization problem gives you

[math]\max_{\alpha} 1^T\alpha - \sum_{i} \sum_{j} \alpha_{i} \alpha_{j} y_{i} y_{j} K(x_{i}, x_{j})[/math]

Now you can plug in any function [math]\mathbb{R}^{d} \times \mathbb{R}^{d} \rightarrow \mathbb{R}[/math] as the kernel function in this optimization problem as long as there is some function [math]\phi(\cdot)[/math] for which [math]K(x_{i}, x_{j}) = \phi(x_{i})^T\phi(x_{j})[/math]. Note that you do not need to know what [math]\phi(\cdot)[/math] is; just the knowledge that it exists is sufficient. This is where Mercer’s theorem comes into play — it states that a kernel function [math]K(x_{i}, x_{j})[/math] has a representation [math]\phi(x_{i})^T\phi(x_{j})[/math] for some [math]\phi(\cdot)[/math] if and only if it is positive semi-definite and symmetric. [I’ll skip the technical definitions here to avoid digressing from the main idea of the question; if you have specific queries on this, please post a separate question.]

The above idea is called the Kernel trick — whenever your optimization problem depends on [math]x_{i}, x_{j}[/math] only in the form of inner products [math]x_{i}^T x_{j}[/math], you can replace [math]x_{i}^T x_{j}[/math] by [math]K(x_{i}, x_{j})[/math], where [math]K(\cdot, \cdot)[/math] satisfies Mercer’s theorem.


Note that in the first segment of the answer, we look at what a kernel function is, while in the second segment, we look at the conditions on kernel functions that can be used in algorithms like SVMs. It is important to note that a function [math]K(x_{i}, x_{j})[/math] is also a kernel function, even if it is not positive semi-definite and symmetric, just that it cannot be used in algorithms which require these conditions. There are algorithms which can make use of such kernels, for instance, see Learning with Non-Positive Kernels, although, these methods are relatively less common.

Profile photo for Soumendra Prasad Dhanee
  1. For a dataset with n features (~n-dimensional), SVMs find an n-1 dimensional hyperplane to separate it (let us say for classification)
  2. Thus, SVMs perform very badly with datasets that are not linearly separable
  3. But, quite often, it’s possible to transform our not-linearly-separable dataset into a higher-dimensional dataset where it becomes linearly separable, so that SVMs can do a good job
  4. Unfortunately, quite often, the number of dimensions you have to add (via transformations) depends on the number of dimensions you already have (and not linearly :( )
    1. For datasets with a lot of features, it beco
  1. For a dataset with n features (~n-dimensional), SVMs find an n-1 dimensional hyperplane to separate it (let us say for classification)
  2. Thus, SVMs perform very badly with datasets that are not linearly separable
  3. But, quite often, it’s possible to transform our not-linearly-separable dataset into a higher-dimensional dataset where it becomes linearly separable, so that SVMs can do a good job
  4. Unfortunately, quite often, the number of dimensions you have to add (via transformations) depends on the number of dimensions you already have (and not linearly :( )
    1. For datasets with a lot of features, it becomes next to impossible to try out all the interesting transformations
  5. Enter The Kernel Trick
    1. Thankfully, the only thing SVMs need to do in the (higher-dimensional) feature space (while training) is computing the pair-wise dot products
    2. For a given pair of vectors (in a lower-dimensional feature space) and a transformation into a higher-dimensional space, there exists a function (The Kernel Function) which can compute the dot product in the higher-dimensional space without explicitly transforming the vectors into the higher-dimensional space first
    3. We are saved!
  6. SVM can now do well with datasets that are not linearly-separable
Your response is private
Was this worth your time?
This helps us sort answers on the page.
Absolutely not
Definitely yes
Yes. All you need to do is enter their name here to see what dating websites or apps they are on.
Profile photo for Hasan Poonawala

SVMs depend on two ideas: VC dimension and optimization. Given points in the plane and the fact that they are separable (Nikhil's answer), there are infinite lines (halfplanes in 2D) that would separate them. The SVM finds the best one by solving an optimization problem on those separating halfplanes. The nice thing is that it is a Linear Programming problem which is fast and easy to solve.

The VC dimension is related to a fact we took for granted: are the training points separable? Well, when it comes to points in 2D, the maximum number of points that are guaranteed to be separable are 3, whic

SVMs depend on two ideas: VC dimension and optimization. Given points in the plane and the fact that they are separable (Nikhil's answer), there are infinite lines (halfplanes in 2D) that would separate them. The SVM finds the best one by solving an optimization problem on those separating halfplanes. The nice thing is that it is a Linear Programming problem which is fast and easy to solve.

The VC dimension is related to a fact we took for granted: are the training points separable? Well, when it comes to points in 2D, the maximum number of points that are guaranteed to be separable are 3, which is precisely the VC dimension. Note the diagram in Nikhil's answer projects the three points on a line to 2D, and he drew a classifier that works. The VC dimension depends on the classifiers (hyperplanes in the case of SVMs) and the dimension which are data lies in (2D in this case). Now if the points are in n-dimensions, the VC dimention is n+1. So instead, if we have m points, we need the points to lie in a space of dimension at least (m-1). When this happens, we are sure to find a solution to the optimization we perform. This is where the kernel comes in. We project the m points lying in n-dimension, n< (m-1), into an (m-1) dimensional space so that we are guaranteed to find a linear classifier in that dimension.

To make it concrete, as in Nikhil's example, we had three points in 1 dimension. Not solvable. So we project it into a space of at least (3-1)=2 dimensions, and we will find a solution.

Edit:
As Yan King Yin points out, three points in a plane cannot be separated if they are collinear. VC dimension excludes these cases (measure zero sets), otherwise half planes couldn't separate anything.

Edit 2:

Since people are still looking at this answer, and the Kernels part is hidden in a comment, I’m pulling it into the answer:

We have a set of data-points that cannot be linearly separated (related to VC dimension of half-planes). Using a nonlinear classifier to separate them is painful.

The linear SVM algorithm really only needs to take the inner product of the points it's trying to classify. This is because half-planes can be determined using inner products, and SVMs are looking for half-planes.

So we want to do two things: we want to map all the data points from the space they come into a useful inner product space, which will usually be of higher dimension (Usefulness is related to separability related to VC dimension). Then we want to use a Linear SVM to classify the points in the this space using a unique half-plane. This means we want to take inner products of the transformed data points.

A kernel function does precisely this, but in one single step. You do not bother with finding the transformed points, since you already know what their dot product will be thanks to the kernel function.

This is the kernel trick: use a kernel function to convert a nonlinear classification into a linear one (if possible).

First off you save a HUGE amount of time because this is a linear program and not some nonlinear optimization because the points are weird.
Second, you save time and space by not mapping the points into the transformed space, but directly going from two points in the original space to the dot product in the higher dimensional space.

The linear classification performed actually becomes a nonlinear classification in the original space

Profile photo for Shashank Gupta

SVM in it's general setting works by finding an "optimal" hyper plane which separates two point clouds of data. But this is limited in the sense that it only works when the two point clouds (each corresponding to a class) can be separated by a hyper plane. What if the separating boundary is non-linear ?

This is some image I found from web. This is a perfect example of the limitation of SVM in it's general setting. SVM will learn the Red colored line from the data points which we can clearly see is not optimal. The optimal boundary is the Green curve which is a non-linear structure.

To overcome

SVM in it's general setting works by finding an "optimal" hyper plane which separates two point clouds of data. But this is limited in the sense that it only works when the two point clouds (each corresponding to a class) can be separated by a hyper plane. What if the separating boundary is non-linear ?

This is some image I found from web. This is a perfect example of the limitation of SVM in it's general setting. SVM will learn the Red colored line from the data points which we can clearly see is not optimal. The optimal boundary is the Green curve which is a non-linear structure.

To overcome this limitation we use the concept of kernels. Given a test point kernels will fit a curve on data points, specifically to points that are 'close' to the test set. So if using an RBF kernel it will fit a gaussian distribution over 'nearby' points of test point. Closeness is defined using std. distance metric.

Suppose this is the function we are trying to approximate. Yellow region is the gaussian that it fit over the training points points that are closed to test point(assume it is one in the center of the gaussian). So it approximates the function (separating boundary) by using piece-wise gaussians (for RBF kernel). And it uses only sparse set from training data(support vectors) to do so.

This is the intuition of using kernels in SVM. Of course there are nice theoretical arguments for using kernels like projection of data points in Infinite Dimensional space where it becomes linearly seperable, but this is the intuition that I have for using kernels. HTH

Where do I start?

I’m a huge financial nerd, and have spent an embarrassing amount of time talking to people about their money habits.

Here are the biggest mistakes people are making and how to fix them:

Not having a separate high interest savings account

Having a separate account allows you to see the results of all your hard work and keep your money separate so you're less tempted to spend it.

Plus with rates above 5.00%, the interest you can earn compared to most banks really adds up.

Here is a list of the top savings accounts available today. Deposit $5 before moving on because this is one of th

Where do I start?

I’m a huge financial nerd, and have spent an embarrassing amount of time talking to people about their money habits.

Here are the biggest mistakes people are making and how to fix them:

Not having a separate high interest savings account

Having a separate account allows you to see the results of all your hard work and keep your money separate so you're less tempted to spend it.

Plus with rates above 5.00%, the interest you can earn compared to most banks really adds up.

Here is a list of the top savings accounts available today. Deposit $5 before moving on because this is one of the biggest mistakes and easiest ones to fix.

Overpaying on car insurance

You’ve heard it a million times before, but the average American family still overspends by $417/year on car insurance.

If you’ve been with the same insurer for years, chances are you are one of them.

Pull up Coverage.com, a free site that will compare prices for you, answer the questions on the page, and it will show you how much you could be saving.

That’s it. You’ll likely be saving a bunch of money. Here’s a link to give it a try.

Consistently being in debt

If you’ve got $10K+ in debt (credit cards…medical bills…anything really) you could use a debt relief program and potentially reduce by over 20%.

Here’s how to see if you qualify:

Head over to this Debt Relief comparison website here, then simply answer the questions to see if you qualify.

It’s as simple as that. You’ll likely end up paying less than you owed before and you could be debt free in as little as 2 years.

Missing out on free money to invest

It’s no secret that millionaires love investing, but for the rest of us, it can seem out of reach.

Times have changed. There are a number of investing platforms that will give you a bonus to open an account and get started. All you have to do is open the account and invest at least $25, and you could get up to $1000 in bonus.

Pretty sweet deal right? Here is a link to some of the best options.

Having bad credit

A low credit score can come back to bite you in so many ways in the future.

From that next rental application to getting approved for any type of loan or credit card, if you have a bad history with credit, the good news is you can fix it.

Head over to BankRate.com and answer a few questions to see if you qualify. It only takes a few minutes and could save you from a major upset down the line.

How to get started

Hope this helps! Here are the links to get started:

Have a separate savings account
Stop overpaying for car insurance
Finally get out of debt
Start investing with a free bonus
Fix your credit

Profile photo for Gillis Danielsen

Here is an even shorter visualization of what an SVM does. Basically the trick is to to look for a projection to higher dimensional space where there is a linear separation of the data (others have posted much more detailed and correct answers so I just wanted to share the vid, not made by me).

Profile photo for Omar R

Part 1. Why do we need Kernals ?

Answer: Assuming that you have heard of Curse of Dimensionality, there is also a concept of Blessing of Dimensionality which says that in higher dimensions the structure of data becomes easier to analyse than in lower dimensions. In other words, we assume that in a classification problem if different classes can not be separated by a linear boundary, then, if we project the data into higher dimensional space, a linear decision boundary may be achievable. Consider the image below (from google images) to get a clear idea of how a nonlinear decision boundary can be

Part 1. Why do we need Kernals ?

Answer: Assuming that you have heard of Curse of Dimensionality, there is also a concept of Blessing of Dimensionality which says that in higher dimensions the structure of data becomes easier to analyse than in lower dimensions. In other words, we assume that in a classification problem if different classes can not be separated by a linear boundary, then, if we project the data into higher dimensional space, a linear decision boundary may be achievable. Consider the image below (from google images) to get a clear idea of how a nonlinear decision boundary can be converted to a linear decision boundary by projecting the data into higher dimensional space.
A Kernal may be used to achieve this transformation.

Part 2: What are Kernals?

Answer:

  • Consider two points/vectors, [math]X[/math] and [math]Y[/math] in the given d dimensional space (say d = 2)
  • Consider a mapping [math]phi[/math], which transforms [math]x[/math] and [math]y[/math] into higher dimensions (say 3). [math]phi[/math] is upto you to choose
  • Consider a Kernal [math]K [/math]. Note that [math]K[/math] lies in the original (2 dim) space.

Then the way to calculate [math]K[/math] is:

  1. Find [math]phi(X)[/math] and [math]phi(Y)[/math].
  2. Find their dot product i.e. [math](<phi(X),phi(Y)>)[/math]
  3. [math](<phi(X),phi(Y)>)[/math] will give us [math]K[/math].

A kernal is a general concept and can be used for many algorithms which are linear in nature when the data is ‘non-linear’.

Profile photo for Abhishek Shivkumar

I know that pasting a link to an external talk would not be an appropriate answer here, but this talk is just so awesome and answers your question so intuitively that I encourage you to watch it

http://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CFgQFjAA&u...

Profile photo for Sreejith Menon

Another famous question that I have encountered which is related to the topic is the computational complexity for predictions using SVM with and without the kernel trick.

Say the original feature space has |X| features. The new feature space has |F| dimensions. A kernel function takes k time to evaluate. Further there are m training examples and |S| support vectors.

Without using the kernel trick: The time taken would be O(|F|).

Explanation: For one testing example we would have to actually compute the term sign(w0 + w*x) which would take time O(|F|).

With the kernel trick: Time taken would be O(k

Another famous question that I have encountered which is related to the topic is the computational complexity for predictions using SVM with and without the kernel trick.

Say the original feature space has |X| features. The new feature space has |F| dimensions. A kernel function takes k time to evaluate. Further there are m training examples and |S| support vectors.

Without using the kernel trick: The time taken would be O(|F|).

Explanation: For one testing example we would have to actually compute the term sign(w0 + w*x) which would take time O(|F|).

With the kernel trick: Time taken would be O(k*|S|)

Explanation: We would simply compute the dot product of testing example with that of the support vectors. The number of support vectors times the time taken to execute the kernel function once would be the computational complexity.

Please correct my answer if I am wrong here.

Profile photo for Gopal Malakar

This is one of the easiest explanation of SVM and Kernel trick.

Profile photo for Ashkon Farhangi

To add to the other answers, an RBF kernel, perhaps the mostly commonly used kernel, acts essentially as a low band pass filter that prefers smoother models. For a full mathematical justification of this fact check out Charles Martin's explanation.

Profile photo for Balaji Pitchai Kannu

It is not possible to find a hyperplane or a linear decision boundary for some classification problems. If we project the data in to a higher dimension from the original space, we may get a hyperplane in the projected dimension that helps to classify the data.

As we shown in the above figure, it is impossible to find a single line to separate the two classes (green and blue) in the input space. But

It is not possible to find a hyperplane or a linear decision boundary for some classification problems. If we project the data in to a higher dimension from the original space, we may get a hyperplane in the projected dimension that helps to classify the data.

As we shown in the above figure, it is impossible to find a single line to separate the two classes (green and blue) in the input space. But, after projecting the data in to a higher dimension (i.e. feature space in the figure), we could able to find the hyperplane which classifies the data. Kernel helps to find a hyperplane in the higher dimensional space without increasing the computational cost much. Usually, the computational cost will increase, if the dimension of the data increases.

How come Kernel doesn’t increase the computational complexity?

We knew that the dot product of same dimensional two vectors gives a single number. Kernel utilizes this property to compute the dot product in a different space without even visiting the space.

Assume that, we have two features. It means that dimension of the data point is [math]\mathbb R^2. [/math]

[math]x_{i} = \begin{bmatrix} x_{i1} \\ x_{i2} \end{bmatrix} \tag*{}[/math]

i in the x subscript represent the data points. Similarly, 1 and 2 in the x subscript denote the features. Also, assume that we are applying some transformation function to convert two dimensional input space (two features) in to four dimensional feature space which is [math](x_{i1}^{2}, x_{i1} x_{i2}, x_{i2} x_{i1}, x_{i2}^{2}).[/math] It requires [math]\mathcal{O}(n^{2})[/math] time to calculate n data points in the four dimensional space. To calculate the dot product of two vectors in the four dimensional space/transformed space, the standard way is

1. Convert each data point from [math] \mathbb R^2 \to \mathbb R^4 [/math]by applying the transformation. (I have taken two data points [math]x_{i}[/math] and [math]x_{j}[/math])

[math]\phi(x_{i}) = \begin{bmatrix} x_{i1}^{2}\\ x_{i1} x_{i2} \\ x_{i2} x_{i1}\\ x_{i2}^{2} \end{bmatrix} \hspace{2cm} \phi(x_{j}) = \begin{bmatrix} x_{j1}^{2}\\ x_{j1} x_{j2} \\ x_{j2} x_{j1}\\ x_{j2}^{2} \end{bmatrix}[/math]

2. Dot product the two vectors.

[math]\phi(x_{i}).\phi(x_{j}) \tag*{}[/math]

As I said before, Kernel function calculates the dot product in the different space without even visiting it. Kernel function for the above transformation is

[math]K(x_{i},x_{j}) = (x_{i}^{T}x_{j})^{2} \tag{1}[/math]

Example:

Let say [math]x_{i} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \hspace{0.5cm}\text{and} \hspace{0.5cm}x_{j} = \begin{bmatrix} 3 \\ 5 \end{bmatrix}.[/math]

The dot product in the four dimensional space by the standard way is

[math]= \begin{bmatrix} 1 \\ 2 \\ 2 \\ 4 \end{bmatrix} \cdot \begin{bmatrix} 9\\ 15 \\ 15 \\ 25 \end{bmatrix} = 9+30+30+100 = 169 \tag*{}[/math]

The above dot product can be calculated using the above kernel function (equation 1) without even transforming the original space.

[math] K(x_{i},x_{j}) = (\begin{bmatrix} 1 \\ 2 \end{bmatrix} ^{T} \cdot \begin{bmatrix} 3 \\ 5 \end{bmatrix})^{2} ...

Profile photo for Sebastian Raschka

Given an arbitrary dataset, you typically don't know which kernel may work best. I recommend starting with the simplest hypothesis space first -- given that you don't know much about your data -- and work your way up towards the more complex hypothesis spaces.

So, the linear kernel works fine if your dataset if linearly separable; however, if your dataset isn't linearly separable, a linear kernel isn't going to cut it (almost in a literal sense ;)).

For simplicity (and visualization purposes), let's assume our dataset consists of 2 dimensions only. Below, I plotted the decision regions of a li

Given an arbitrary dataset, you typically don't know which kernel may work best. I recommend starting with the simplest hypothesis space first -- given that you don't know much about your data -- and work your way up towards the more complex hypothesis spaces.

So, the linear kernel works fine if your dataset if linearly separable; however, if your dataset isn't linearly separable, a linear kernel isn't going to cut it (almost in a literal sense ;)).

For simplicity (and visualization purposes), let's assume our dataset consists of 2 dimensions only. Below, I plotted the decision regions of a linear SVM on 2 features of the iris dataset:

This works perfectly fine. And here comes the RBF kernel SVM :

Now, it looks like both linear and RBF kernel SVM would work equally well on this dataset. So, why prefer the simpler, linear hypothesis? Think of Occam's Razor in this particular case. Linear SVM is a parametric model, an RBF kernel SVM isn't, and the complexity of the latter grows with the size of the training set. Not only is it more expensive to train a RBF kernel SVM, but you also have to keep the kernel matrix around, and the projection into this "infinite" higher dimensional space were the data becomes linearly separable is more expensive as well during prediction. Furthermore, you have more hyperparameters to tune, so model selection is more expensive as well! And finally, it's much easier to overfit a complex model!

Okay, what I've said above sounds all very negative regarding kernel methods, but it really depends on the dataset. E.g., if your data is not linearly separable, it doesn't make sense to use a linear classifier:

In this case, a RBF kernel would make so much more sense:

In any case, I wouldn't bother too much about the polynomial kernel. In practice, it is less useful for efficiency (computational as well as predictive) performance reasons. So, the rule of thumb is: use linear SVMs (or logistic regression) for linear problems, and nonlinear kernels such as the Radial Basis Function kernel for non-linear problems.

Btw. the RBF kernel SVM decision region is actually also a linear decision region. What RBF kernel SVM actually does is to create non-linear combinations of your features to uplift your samples onto a higher-dimensional feature space where you can use a linear decision boundary to separate your classes:

Okay, above, I walked you through an intuitive example where we can visualize our data in 2 dimensions ... but what do we do in a real-world problem, i.e., a dataset with more than 2 dimensions? Here, we want to keep an eye on our objective function: minimizing the hinge-loss. We would setup a hyperparameter search (grid search, for example) and compare different kernels to each other. Based on the loss function (or a performance metric such as accuracy, F1, MCC, ROC auc, etc.) we could determine which kernel is "appropriate" for the given task. I've some more posts here if it helps: How do I evaluate a model? | A Basic Pipeline and Grid Search Setup via scikit-learn: Jupyter Notebook Viewer

Your response is private
Was this worth your time?
This helps us sort answers on the page.
Absolutely not
Definitely yes
Profile photo for Prasoon Goyal

A2A.

As others have pointed out, there’s no way to figure out which kernel would do the best for a particular problem. The only way to choose the best kernel is to actually try out all possible kernels, and choose the one that does the best empirically. However, we can still look at some differences between various kernel functions, to have some rules of thumb.

Let’s start by listing the kernel functions:

  • Linear: [math]K(x, y) = x^Ty[/math]
  • Polynomial: [math]K(x, y) = (x^Ty + 1)^d[/math]
  • Sigmoid: [math]K(x, y) = tanh(a x^Ty + b)[/math]
  • RBF: [math]K(x, y) = \exp(-\gamma \| x - y\|^2)[/math]

Now, let’s look at some differences:

  • Translation invariance: RB

A2A.

As others have pointed out, there’s no way to figure out which kernel would do the best for a particular problem. The only way to choose the best kernel is to actually try out all possible kernels, and choose the one that does the best empirically. However, we can still look at some differences between various kernel functions, to have some rules of thumb.

Let’s start by listing the kernel functions:

  • Linear: [math]K(x, y) = x^Ty[/math]
  • Polynomial: [math]K(x, y) = (x^Ty + 1)^d[/math]
  • Sigmoid: [math]K(x, y) = tanh(a x^Ty + b)[/math]
  • RBF: [math]K(x, y) = \exp(-\gamma \| x - y\|^2)[/math]

Now, let’s look at some differences:

  • Translation invariance: RBF kernel is the only kernel out of the above that is translation invariant, that is, [math]K(x, y) = K(x + t, y + t)[/math], where t is any arbitrary vector. Intuitively, this property is useful — if you imagine all your data lying in some space, then the similarity between the points should not change if you shift the entire data, without changing the relative positions of the points.
  • Inner product vs Euclidean distance: Related to the above point, RBF kernel is a function of the Euclidean distance between the points, whereas all other kernels are functions of inner product of the points. Again, it makes more intuitive sense to have Euclidean distance — points that are closer should be more similar. If two points are close to the origin, but on opposite sides, then the inner product based kernels assign the pair a low value, but Euclidean distance based kernels assign the pair a high value. It is, however, important to note that for some applications, inner product is sometimes the more preferred similarity metric, like in bag-of-words vectors, because you care more about the direction of the vectors (which words appear in both the document vectors) rather than the actual counts.
  • Normalized: A kernel is said to be normalized if [math]K(x, x) = 1[/math] for all [math]x[/math]. This is true for only RBF kernel in the above list. Again, intuitively, you want this property to hold — if [math]x[/math] and [math]x[/math] have a similarity of [math]\lambda[/math], then [math]2x[/math] and [math]2x[/math] should also have a similarity of [math]\lambda[/math]. You can convert an arbitrary kernel [math]K(x, y)[/math] to a normalized kernel [math]\tilde{K}(x, y)[/math] by defining [math]\tilde{K}(x, y) = \dfrac{K(x, y)}{\sqrt{K(x, x)} \sqrt{K(y, y)}}[/math]. (Also, as a side note, RBF kernel is the normalized kernel for the exponential kernel, [math]K(x, y) = \exp(x^Ty)[/math].)

These properties tend to make RBF kernel better in general, for most problems. And because it does the best empirically, it tends to be most widely used. However, just to reiterate, depending on the nature of the problem, it is possible that one of the other kernels does better than RBF kernel.

Profile photo for Quora User

Whatever you want it to be.

You can use whatever inner product

you want to use, and you will get different results accordingly. Most software packages will just use the standard dot product unless you specify a different kernel should be used.

If you want to know how the kernel influences the model, you really should read tutorials on the mathematics behind SVM. I’ll try to summarize it, but you should realize that this is a very rough summary:

We want to find the best hyperplane that separates the two classes of our data. Here “best” means: ideally it has all the positive examples on one side,

Footnotes

Whatever you want it to be.

You can use whatever inner product

you want to use, and you will get different results accordingly. Most software packages will just use the standard dot product unless you specify a different kernel should be used.

If you want to know how the kernel influences the model, you really should read tutorials on the mathematics behind SVM. I’ll try to summarize it, but you should realize that this is a very rough summary:

We want to find the best hyperplane that separates the two classes of our data. Here “best” means: ideally it has all the positive examples on one side, and all negative examples on the other side; and the samples closest to the line are not too close — we want the maximum margin. If such a line (hyperplane) does not exist, we want the wrongly labeled samples to lie close to the line.

Now, instead of just using a hyperplane in our original space, we could project our data into a higher-dimensional space. For examples, if our data has [math]x[/math] and [math]y[/math] coordinates, we can project each point to [math](x^2, y^2, xy, x, y)[/math] in 5-dimensional space. This is really cool: a hyperplane in this space corresponds to any conic section we want, in our original space. So now we can have models that say “every point within this particular ellipse is positive, everything else is negative”.

It would be annoying if we manually have to do the projections for all the points. That’s where the kernel comes in: we don’t need to do that. We just need the corresponding kernel function that, for two given points, gives us the dot product in our projected 5-dimensional space. That’s what we call the “kernel trick”.

In fact, we do not even really care about what space we are mapping to. As long as the kernel function is a valid inner product (see Inner product space - Wikipedia) we’re good to go. Sometimes this even corresponds to a mapping to an infinite-dimensional space.

Footnotes

Profile photo for Balaji Pitchai Kannu

It is not possible to find a hyperplane or a linear decision boundary for some classification problems. If we project the data in to a higher dimension from the original space, we may get a hyperplane in the projected dimension that helps to classify the data.

As we shown in the above figure, it is impossible to find a single line to separate the two classes (green and blue) in the input space. But

It is not possible to find a hyperplane or a linear decision boundary for some classification problems. If we project the data in to a higher dimension from the original space, we may get a hyperplane in the projected dimension that helps to classify the data.

As we shown in the above figure, it is impossible to find a single line to separate the two classes (green and blue) in the input space. But, after projecting the data in to a higher dimension (i.e. feature space in the figure), we could able to find the hyperplane which classifies the data. Kernel helps to find a hyperplane in the higher dimensional space without increasing the computational cost much. Usually, the computational cost will increase, if the dimension of the data increases.

How come Kernel doesn’t increase the computational complexity?

We knew that the dot product of same dimensional two vectors gives a single number. Kernel utilizes this property to compute the dot product in a different space without even visiting the space.

Assume that, we have two features. It means that dimension of the data point is [math]\mathbb R^2. [/math]

[math]x_{i} = \begin{bmatrix} x_{i1} \\ x_{i2} \end{bmatrix} \tag*{}[/math]

i in the x subscript represent the data points. Similarly, 1 and 2 in the x subscript denote the features. Also, assume that we are applying some transformation function to convert two dimensional input space (two features) in to four dimensional feature space which is [math](x_{i1}^{2}, x_{i1} x_{i2}, x_{i2} x_{i1}, x_{i2}^{2}).[/math] It requires [math]\mathcal{O}(n^{2})[/math] time to calculate n data points in the four dimensional space. To calculate the dot product of two vectors in the four dimensional space/transformed space, the standard way is

1. Convert each data point from [math] \mathbb R^2 \to \mathbb R^4 [/math]by applying the transformation. (I have taken two data points [math]x_{i}[/math] and [math]x_{j}[/math])

[math]\phi(x_{i}) = \begin{bmatrix} x_{i1}^{2}\\ x_{i1} x_{i2} \\ x_{i2} x_{i1}\\ x_{i2}^{2} \end{bmatrix} \hspace{2cm} \phi(x_{j}) = \begin{bmatrix} x_{j1}^{2}\\ x_{j1} x_{j2} \\ x_{j2} x_{j1}\\ x_{j2}^{2} \end{bmatrix} \tag*{}[/math]

2. Dot product the two vectors.

[math]\phi(x_{i}).\phi(x_{j}) \tag*{}[/math]

As I said before, Kernel function calculates the dot product in the different space without even visiting it. Kernel function for the above transformation is

[math]K(x_{i},x_{j}) = (x_{i}^{T}x_{j})^{2} \tag{1}[/math]

Example:

Let say [math]x_{i} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \hspace{0.5cm}\text{and} \hspace{0.5cm}x_{j} = \begin{bmatrix} 3 \\ 5 \end{bmatrix}.[/math]

The dot product in the four dimensional space by the standard way is

[math]= \begin{bmatrix} 1 \\ 2 \\ 2 \\ 4 \end{bmatrix} \cdot \begin{bmatrix} 9\\ 15 \\ 15 \\ 25 \end{bmatrix} = 9+30+30+100 = 169 \tag*{}[/math]

The above dot product can be calculated using the above kernel function (equation 1) without even transforming the original space.

[math] K(x_{i},x_{j}) = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \cdot \begin{bmatrix} 3 \\ 5 \end{bmatrix} =...

Profile photo for Yoshua Bengio

Basically, a kernel-based SVM requires on the order of n^2 computation for training and order of nd computation for classification, where n is the number of training examples and d the input dimension (and assuming that the number of support vectors ends up being a fraction of n, which is shown to be expected in theory and in practice). Instead, a 2-class linear SVM requires on the order of nd computation for training (times the number of training iterations, which remains small even for large n) and on the order of d computations for classification. So when the number of training examples is

Basically, a kernel-based SVM requires on the order of n^2 computation for training and order of nd computation for classification, where n is the number of training examples and d the input dimension (and assuming that the number of support vectors ends up being a fraction of n, which is shown to be expected in theory and in practice). Instead, a 2-class linear SVM requires on the order of nd computation for training (times the number of training iterations, which remains small even for large n) and on the order of d computations for classification. So when the number of training examples is large (e.g. millions of documents, images, or customer records) then kernel SVMs are too expensive for training and very expensive for classification (unless one uses one of the approximations that have been proposed but have not yet become widely used as far as I know; someone more expert in this area could correct me). In the case of sparse inputs, linear SVMs are also convenient because d in the above can be replaced by the average number of non-zeros in each example input vector.

Profile photo for Shayne Miel

Two reasons. The first is that, depending on the type of kernel, it projects your data into a higher dimensional feature space. Sometimes into a space of infinite dimensions! If your data is not linearly separable in the original feature space, there's a good chance that it might be when projected into higher dimensions.

The second, perhaps more intuitive, reason is that it imbues each of your data points with information about the rest of the training set. The kernel can be viewed as a measure of similarity, so that your features become "how similar is instance 1 to instance 2?", "how similar

Two reasons. The first is that, depending on the type of kernel, it projects your data into a higher dimensional feature space. Sometimes into a space of infinite dimensions! If your data is not linearly separable in the original feature space, there's a good chance that it might be when projected into higher dimensions.

The second, perhaps more intuitive, reason is that it imbues each of your data points with information about the rest of the training set. The kernel can be viewed as a measure of similarity, so that your features become "how similar is instance 1 to instance 2?", "how similar is instance 1 to instance 3?", etc. This makes it easy for the classifier to say, "Instance 1 is a lot like these other instances who all have label 'A'. Perhaps instance 1 should also have label 'A'."

Profile photo for Sumit Soman

Support Vector Machines tend to find a linear decision boundary between points of two classes based on the maximum-margin principle, where the objective is to find a set of points which lie on two sides of the plane at a distance of at least unity. These points are the support vectors and the plane midway is the separating hyperplane which is always a linear plane.

Now, in practice, the dataset may not be linearly separable. Take the common example of a two-input XOR gate, with inputs x1 and x2 and output y. They are related as

x1 x2 y

0 0 0

0 1 1

1 0 1

1 1 0

Now if you

Support Vector Machines tend to find a linear decision boundary between points of two classes based on the maximum-margin principle, where the objective is to find a set of points which lie on two sides of the plane at a distance of at least unity. These points are the support vectors and the plane midway is the separating hyperplane which is always a linear plane.

Now, in practice, the dataset may not be linearly separable. Take the common example of a two-input XOR gate, with inputs x1 and x2 and output y. They are related as

x1 x2 y

0 0 0

0 1 1

1 0 1

1 1 0

Now if you plot these points in two dimensions with x1 and x2 as the features and y as the label, you can see that it is not possible to find a linear separating hyperplane that would separate the points of the two classes in this space of two dimensions.

Now, let us say we introduce a third dimension x3, which is computed as x3=(x1-x2)^2. The data projected in this high dimensional space will be

x1 x2 x3 y

0 0 0 0

0 1 1 1

1 0 1 1

1 1 0 0

Now if you visualize the data in this 3-dimensional space as shown below, you can see that it is linearly separable by a hyperplane.

I have not shown the hyperplane in the figure but it is easy to visualize several linear planes that can separate the above dataset. This is what kernels allow us to do. We can implicitly map the data to a higher dimensional space where it is linearly separable, and solve the SVM formulation in that space to obtain a linear decision boundary.

There can be several kernel functions. Those satisfying Mercer's conditions are used, for more details one can refer to the tutorial on SVMs by Burges.

Burges, Christopher JC. "A tutorial on support vector machines for pattern recognition." Data mining and knowledge discovery 2.2 (1998): 121-167.

Profile photo for Quora User

A set of math functions specified by the kernel are used by SVM algorithms. The kernel function is to take the information as an input and translate it into the form necessary. Various SVM algorithms use various kernel types. There can be all kinds of functions. For instance, linear, nonlinear, polynomial, radial foundation function (RBF), and sigmoid function.

Introduce sequence data, graphs, text , images and vectors functions. Kernel functions. RBF is the most frequently used kernel type. Since the whole x-axis has found and finite responses.
The functions of the kernel return the internal

A set of math functions specified by the kernel are used by SVM algorithms. The kernel function is to take the information as an input and translate it into the form necessary. Various SVM algorithms use various kernel types. There can be all kinds of functions. For instance, linear, nonlinear, polynomial, radial foundation function (RBF), and sigmoid function.

Introduce sequence data, graphs, text , images and vectors functions. Kernel functions. RBF is the most frequently used kernel type. Since the whole x-axis has found and finite responses.
The functions of the kernel return the internal product in a suitable space between two points. Therefore, by defining an idea of similarity, even in very large spaces with little computational expense.


1. Kernel Rules

Specify the following kernel or window function:

Kernel or a window function

This function has a value of 1 centred in the closed radius 1 ball and 0 centred in the origin. As shown in the following figure :

Kernel or a window function

In the case of fixed xi, the function is K(z-xi)/h) = 1 in the h-centered, closed ball and 0 in the figure shown in the following: Fixed xi

Kernel or a window function

Therefore, you have changed the window by selecting the statement K(• to be centred at point xi and radius h.


2. SVM Kernels Examples :

Let's see some common SVM kernels and their uses:

2.1. Polynomial Kernel

Image processing is common. It's equation:

Where the polynomial degree is d.

2.2. Gaussian Kernel

It is a generic kernel; used when the details are not previously documented. It's equation:

2.3. Gaussian Radial Basis Function (RBF)

It is a kernel for general purposes; it is used when no specific information about the data is available. The equivalent is :

, for :

Often set to the following parameters :

2.4 Laplace RBF Kernel

It is a typical kernel used when data are not previously established. The same is :

2.5. Hyperbolic Tangent Kernel

We can use it in neural networks. Equation is :

2.6. Sigmoid Kernel

We may use this as a neural network proxy. It is equation :

2.7. Bessel function of the first kind Kernel

We may use this in mathematical functions to eliminate the cross term. The same is :

Where j is the first-class Bessel function.

2.8. ANOVA Radial Basis Kernel

We can use it in problems with regression. The same is:

2.9. Linear Splines Kernel in One-Dimension

It is helpful for managing large sparse data vectors. It is widely used in categorizing text. In regression problems, the splines kernel fits well. The same is :

So please discuss with me if you have a concern about SVM kernel functions. I am pleased to answer your questions.

Profile photo for Bülent Koçer

I think, you had found your answer so far. But I’m sure there are still some people who wouldn’t been able to find an intuitive and simple answer to this question. I will explain without any mathematics, with a few fruits.

When you are about to eat a strawberry, you can easily separate the red part and the green part simply cutting with your teeth. Because these parts are “linearly” separable.

On the other hand, the delicious part of a banana is inside its “shell”. Therefore, it’s not so simple to separate the delicious part and not-so-delicious part with a simple bite. So you need another “meth

I think, you had found your answer so far. But I’m sure there are still some people who wouldn’t been able to find an intuitive and simple answer to this question. I will explain without any mathematics, with a few fruits.

When you are about to eat a strawberry, you can easily separate the red part and the green part simply cutting with your teeth. Because these parts are “linearly” separable.

On the other hand, the delicious part of a banana is inside its “shell”. Therefore, it’s not so simple to separate the delicious part and not-so-delicious part with a simple bite. So you need another “method” to make these parts linearly separable.

This is what the kernel function makes in an SVM problem, it makes the linearly inseparable “fruit” to the sum of yummy part and the part to be “discarded”.

Profile photo for Anonymous
Anonymous

You seem to be comparing apples and oranges. So I am not sure what part is confusing for you, so I'll try to briefly cover all things.

Hard-margin
You have the basic SVM - hard margin. This assumes that data is very well behaved, and you can find a perfect classifier - which will have 0 error on train data.

Soft-margin
Data is usually not well behaved, so SVM hard margin may not have solution at all. So we allow for a little bit of error on some points. So the training error will not be 0, but average error over all points is minimized.

Kernels
The above assume that the best classifier is a st

You seem to be comparing apples and oranges. So I am not sure what part is confusing for you, so I'll try to briefly cover all things.

Hard-margin
You have the basic SVM - hard margin. This assumes that data is very well behaved, and you can find a perfect classifier - which will have 0 error on train data.

Soft-margin
Data is usually not well behaved, so SVM hard margin may not have solution at all. So we allow for a little bit of error on some points. So the training error will not be 0, but average error over all points is minimized.

Kernels
The above assume that the best classifier is a straight line. But what is it is not a straight line. (e.g. it is a circle, inside circle is one class, outside is another class). If we are able to map the data into higher dimension - the higher dimension may give us a straight line.

There are many types of kernels that do this high dimensional mapping (Gaussian, Polynomial, etc.)


Solving SVM
When you solve the SVM optimization in the dual form (please see any material online for details), you realize that the solution depends on the dot product of only some of the training data points.

For traditional linear SVM (hard or soft margin) - the dot product will be
x1 . x2
This dot product is in the original space, so we call it a
linear kernel.

For other kernels, this dot product will be computed as
[math] \phi(x_1) \cdot \phi(x_2)\\ \phi()[/math]
is the high dimension mapping


This dot product can be obtained using kernel functions (polynomial, Gaussian, etc.)

Profile photo for Sourav Chatterjee

Just to improve upon the earlier answer, Kernel functions are useful because of what is known as the "kernel trick".

If you have a solution to a problem which you can express as an inner product of the querying vector, and the training vector (which is known as the dual form), that makes life much easier.
Why? because once you write out the solution in this form, you can potentially compute the inner product using any complicated kernel you want. The rest of the solution remains untouched, and your solution is in the original low dimensional space. You can get away with just computing the kerne

Just to improve upon the earlier answer, Kernel functions are useful because of what is known as the "kernel trick".

If you have a solution to a problem which you can express as an inner product of the querying vector, and the training vector (which is known as the dual form), that makes life much easier.
Why? because once you write out the solution in this form, you can potentially compute the inner product using any complicated kernel you want. The rest of the solution remains untouched, and your solution is in the original low dimensional space. You can get away with just computing the kernels themselves in really high-dimensional space (even infinite dimensional space, as in the case of the Gaussian kernel). Pretty neat!

Profile photo for Pramit Choudhary

SVM helps in identifying the hyperplane that best separates the input space according to the class labels. Performance of a SVM model often depends on the choice of the kernel selection which helps in separating the data both linearly as well non-linear(by separating the data linearly in higher dimensional space).

Choice of Kernels:

  1. Linear
  2. Polynomial
  3. RBF (**my choice for non-linear decision boundaries. Low variance with high accuracy)
  4. Sigmoid(** may have large variance because its not strictly positive definite(non-PSD) that might lead to in-correct approximation)

Polynomial and RBF are particularly

SVM helps in identifying the hyperplane that best separates the input space according to the class labels. Performance of a SVM model often depends on the choice of the kernel selection which helps in separating the data both linearly as well non-linear(by separating the data linearly in higher dimensional space).

Choice of Kernels:

  1. Linear
  2. Polynomial
  3. RBF (**my choice for non-linear decision boundaries. Low variance with high accuracy)
  4. Sigmoid(** may have large variance because its not strictly positive definite(non-PSD) that might lead to in-correct approximation)

Polynomial and RBF are particularly useful when the data points are not linearly separable. However, polynomial function are difficult to control and can get computationally expensive. From my understanding, using Sigmoid kernel is similar to using a 2-layer perceptron(sigmoid functions are used as activation functions).

Notebook example

References:

Profile photo for Luis Argerich

First: Why is the RBF Kernel the most widely used? Because SVM is intrinsically a linear separator when the classes are not linearly separable we can project the data into a high dimensionality space and with a high probability find a linear separation. This is Cover's theorem and the RBF Kernel does exactly that: it projects the data into infinite dimensions and then finds a linear separation.

The linear kernel works great when you have a lot of features because then chances are your data is already linearly separable and a SVM will find the best separating hyperplane. Linear kernels are then

First: Why is the RBF Kernel the most widely used? Because SVM is intrinsically a linear separator when the classes are not linearly separable we can project the data into a high dimensionality space and with a high probability find a linear separation. This is Cover's theorem and the RBF Kernel does exactly that: it projects the data into infinite dimensions and then finds a linear separation.

The linear kernel works great when you have a lot of features because then chances are your data is already linearly separable and a SVM will find the best separating hyperplane. Linear kernels are then great for very sparse data like text.

When data is not linearly separable the first choice is always a RBF kernel because they are very flexible and for the reasons I explained in the first paragraph.

The practical way to decide which kernel to use is by cross-validation, there's no way to fight against success, if you find a kernel that works really well for your data then that's the winner.

Profile photo for Mikael Rusin

SVD stands for Singular Value Decomposition.

It’s a specific form of series of decompositions - that allows for singular decompositional centralization of Vectors on both hand sides of an assignment operation (=).

SVD can be utilized to sully forth pseudo-inverses.

The reason why this is important - is because of a relationship that is called the Reciprocal theorem.

Where a functional interplay - bounds over and elapses to “relapse” to it’s origo - but not fully.

I.e - it’s a pseudo inverse.

This form of operation is extremely useful, because it’s an ingenius trick when it comes to Scaling factoriza

SVD stands for Singular Value Decomposition.

It’s a specific form of series of decompositions - that allows for singular decompositional centralization of Vectors on both hand sides of an assignment operation (=).

SVD can be utilized to sully forth pseudo-inverses.

The reason why this is important - is because of a relationship that is called the Reciprocal theorem.

Where a functional interplay - bounds over and elapses to “relapse” to it’s origo - but not fully.

I.e - it’s a pseudo inverse.

This form of operation is extremely useful, because it’s an ingenius trick when it comes to Scaling factorizations - and utilizing lower dimensional representations - to calculate based off of Attributation.

To perform a full Inversion - is a costly operation.

Mostly - due to it being costly to perform deductive analytics of such high dimensions.

Here - is where the ingenuity of this decompositional factorization shows it’s strength.

You see - under the conditions of formations of Matrises - under the predication of the System, the diagonalized values - and a couple of assertions - we can reduce the dimensionality - and then by virtue of performing the decompositional series of operations - such as scaling, rotating and flattening -

we can keep the generalization of the geometrical spatial relationship in terms of a lower dimensional space.

This is akin to drawing a map - generalizing the Vectorial space by predication of Eigenvalue relationships - reducing that map unto a Flat structure - and still keeping that relationship predicated - as the attributes are Invariant.

Seeing how the Trace is Invariant - the identity mapping under the specific conditions of our decomposition - remains intact.

By virtue of utilizing this Invariant relationship, by virtue of utilizing the Decompostional factorization and a Lowering of Dimensions - you basically lower every single other attribute and Dimensionality - utilizing a lower Dimensional structure - whilst retaining the same identity attributes.

Why this is extremely, extremely important - is because under Infinite dimensional kernel predications -

you genralize to a higher dimensional space.

However - a very common factorization - is to circumvent loss of information and generalize utilizing the Geometrical invariant attributes of the Formulational formulations of the Linear formations of the same Structure - whilst retaining minimal loss of information.

A trivial example, would be to presume a matris to be a form of simple system of Fractions - akin to having a 3x3 Matrix - with values of :

10/50 10/50 10/50
05/50 05/50 10/50
00 00 00

If you are a keen observer - you’ll come to see - that you can equivalently express a Fraction as a lesser form of another one - yet entail to the exact nature of the Geometrical distancing, due to normalization of Whole real numbers.

So - we could re-write this as:

1/5 1/5 1/5 1/5
1/10 1/10 1/5
00 00 00

Now - the point i am making here - is just for illustration.

That by utilizing the reformulation of shortening, scaling, rotation -

we can formalize a form of diagonalized and reduced Dimensional mapping, that retains it’s scaling factors.

It retains the informational attributal - yet having reduced the dimensionalities - that are relative.

This is exactly what we do in, for instance, Linear Least Squares.

Where by utilizing the minimal differential between the Residual interplays - we can ascertain minimal deviation - and minimize loss when Dimensionality is reduced.

However - in the case of SVD - we are generalizing the Spectral Allotment in terms of Vectorial interplays - and then by utilizing the Reciprocal theorem in relation to formulations of the Matrises - and utilizing the predication of Identity ascribation and Identity Equality - we can perform a reduction of Dimensionality without any loss of information.

This is an extremely clever trick in terms of utilization of the Identity mapping - and utilizing the Unitary operational interplay - to utilize a lower dimensional representation - without any loss of Information.

It’s effectively like having taken the generalized minimalized normed - of the Spatial generalization of Vectors - and having utilized the Reciprocity of the Matris forms.

There exists extensions to it - as well - in terms of Atomic formulations - where the decomposition is a factorization across different dimensional cases - in relation to different Rank dynamics and Operators.

But - for sake of discussion - let’s stick to the Unitary case of the Singular Decomposition. It’s well enough to denote that the integrations of Dualities and the Hillbert spaces are there - if one wishes to seek it out.

It’s an ingenius way of handling Orthogonality - and utilizing the identity elements in relation to Singular decompositional invariance - to formalize min-max Fitting in relation to Orthogonal structures.

It’s utilized in many optimizations Schemes - where they predicate to formulate a form of Reduction of Form - or where a Fitting theorem must adhere to the conceptualization of minimization - especially in relation to Identity adherence.

Practical examples can vary from Molecular structure, to Spectral Analytics, to that of Manifold Approximations.

Seeing how the Hillbert spaces denotation is formulated here - I believe there is many, many roots that go further into Banach spaces - and into Functional Space representative dynamics - in relation to Orthogonal Min-Max decompositions.

It’s one of the standing pillars in relation to inductive reasoning - as well.

You see - in Induction - especially in mathematical proofs - you have to just showcase the base case - and one case beyond that - to project unto Factorial dynamics.

Which means - that if you can decompose unto Orthogonal Compression and Factorize accordingly to Operations - the Dimensionality of the problematique, does not matter.

It’s a generalization that will be able to formulate to any Dimensional space - and serves to be a form of “Skew” mapping.

Another case of usage - is formulations of Convolutional Filter dynamics.

You can utilize it to bring together Dimensions of different Continuums - to perform SVD on the integrated Orthogonal space.

Yet another benefit of the fact that it’s a manipulation of Identity elements.

It can double act as directional operational functional reductions - in relation to Kernel spaces - and even Co-Kernels.

Such an ingenius tool.

Thank you for this A2A - it has been fascinating.

Profile photo for Daniel Martín

There are many possible choices, but the most popular ones are either not to use a kernel, or to use a Gaussian kernel. I'll try to explain when to use each one:

  • No kernel at all. That's an option. Many SVM packages allow the use of SVM without a kernel (or using a "linear kernel", in some notations). When to use: It might be a good idea not to use a kernel when you have a large number of features and a small number of training examples. The reason is that you want to avoid potential overfitting due to the use of a non-linear function. Also, using no kernel might be good when performance constr

There are many possible choices, but the most popular ones are either not to use a kernel, or to use a Gaussian kernel. I'll try to explain when to use each one:

  • No kernel at all. That's an option. Many SVM packages allow the use of SVM without a kernel (or using a "linear kernel", in some notations). When to use: It might be a good idea not to use a kernel when you have a large number of features and a small number of training examples. The reason is that you want to avoid potential overfitting due to the use of a non-linear function. Also, using no kernel might be good when performance constraints are tight (for example, in real-time applications).
  • Use a Gaussian kernel. You should do feature scaling before applying this type of kernel. This is important because if you don't do feature scaling and your features take a wide range of values, then the SVM would give more importance to those features with the highest values. When to use: It might be a good kernel for when you have a large number of training examples and a small number of features.


Other minority kernels are string kernels (
http://en.wikipedia.org/wiki/String_kernel), which are suitable when your features are strings (study them if you are doing some kind of NLP), or Chi Square kernels, which are very used in my field, Computer Vision, because they are suitable for histogram-style feature vectors, like the histogram representation of a picture.

Profile photo for VH

In AI, a part is a capability that is utilizes to quantify the comparability between sets of data of interest in a given dataset. Portions are utilised in different calculations, for example, support vector machines (SVMs) and bit head part examination (KPCA), and their motivation is to change the information into a more helpful structure.

The essential thought behind utilizing a portion is to plan the info information into a higher-layered space where it could be simpler to isolate various classes or gatherings of data of interest. In this higher-layered space, the bit capability can quantify

In AI, a part is a capability that is utilizes to quantify the comparability between sets of data of interest in a given dataset. Portions are utilised in different calculations, for example, support vector machines (SVMs) and bit head part examination (KPCA), and their motivation is to change the information into a more helpful structure.

The essential thought behind utilizing a portion is to plan the info information into a higher-layered space where it could be simpler to isolate various classes or gatherings of data of interest. In this higher-layered space, the bit capability can quantify the closeness between data of interest by ascertaining the spot item between their comparing highlight vectors. By finding the ideal portion capability, the calculation can become familiar with a choice limit that best isolates the various classes or gatherings of data of interest.

Portions enjoy a few benefits in AI calculations. For instance:

They consider non-straight choice limits: By planning the information into a higher-layered space, parts can take into consideration non-direct choice limits that are impractical with straight strategies like calculated relapse or straight SVMs.

They can be computationally productive: By utilizing the piece stunt, the estimations engaged with finding the ideal choice limit should be possible in the first component space, despite the fact that the information is planned to a higher-layered space. This can prompt computational reserve funds, particularly when the information has countless elements.

They can be custom-made to the main concern: There are a wide range of portion capabilities accessible, and picking the right one can work on the presentation of the calculation. Some normal part works incorporate straight, polynomial, spiral premise capability (RBF), and sigmoid portions.

Generally speaking, portions assume a significant part in AI calculations by considering non-straight choice limits and making calculations more productive.

Profile photo for Jones Gitau

SVM, which stands for Support Vector Machine, is a supervised machine learning algorithm used for classification and regression tasks, where the goal is to find the optimal dividing line (hyperplane) that best separates different data classes by maximizing the margin between them, making it particularly effective for complex datasets that may not be easily separated by a simple straight line; essentially, it aims to identify the "best" boundary between data points of different categories by considering the data points closest to the boundary, called "support vectors" which have the most influe

SVM, which stands for Support Vector Machine, is a supervised machine learning algorithm used for classification and regression tasks, where the goal is to find the optimal dividing line (hyperplane) that best separates different data classes by maximizing the margin between them, making it particularly effective for complex datasets that may not be easily separated by a simple straight line; essentially, it aims to identify the "best" boundary between data points of different categories by considering the data points closest to the boundary, called "support vectors" which have the most influence on the classification process.

Profile photo for Håkon Hapnes Strand

For performance, the Gaussian kernel.

For simplicity, speed and interpretability, the linear kernel.

However, I rarely employ SVMs myself as there are almost always better options. The long training times and lack of interpretability is rarely justified by predictive performance compared to simpler regressions or tree methods.

Profile photo for Kevin Lacker

It depends on a lot of parameters. For some datasets a kernelized SVM won't be any slower.

One situation where this comes up is, with a linear SVM you can optimize on the coefficients on the dimensions directly, whereas with a kernelized SVM you have to optimize a coefficient for each point. With a lot more points than dimensions, the solution space is smaller for the linear SVM.

Another situation this comes up is if you choose a slow kernel function, the cost of a single distance calculation can be a lot larger than with a linear SVM.

In general if the speed of this algorithm is a bottleneck, yo

It depends on a lot of parameters. For some datasets a kernelized SVM won't be any slower.

One situation where this comes up is, with a linear SVM you can optimize on the coefficients on the dimensions directly, whereas with a kernelized SVM you have to optimize a coefficient for each point. With a lot more points than dimensions, the solution space is smaller for the linear SVM.

Another situation this comes up is if you choose a slow kernel function, the cost of a single distance calculation can be a lot larger than with a linear SVM.

In general if the speed of this algorithm is a bottleneck, you are probably doing things wrong, but my perspective is from industry rather than from academia so there's a lot of situations where things could be different for you.

Profile photo for Quora User

SVD stands for “singular value decomposition”. It is a matrix factorization technique where a matrix is decomposed into a product of a square matrix, a diagonal (possible rectangular) matrix, and another square matrix.

The diagonal matrix contains the “singular values”, the square roots of the eigenvectors of [math]M^H \cdot M[/math] (and of [math]M\cdot M^H[/math], if that is more convenient) where [math]M[/math] is your original matrix and [math]M^H[/math] is the hermitian (if you’re dealing with real numbers, it’s the same as [math]M^T[/math]). The square matrices contain the corresponding eigenvectors of [math]M\cdot M^H[/math] and of [math]M^H \cdot M[/math].

It is often used to

SVD stands for “singular value decomposition”. It is a matrix factorization technique where a matrix is decomposed into a product of a square matrix, a diagonal (possible rectangular) matrix, and another square matrix.

The diagonal matrix contains the “singular values”, the square roots of the eigenvectors of [math]M^H \cdot M[/math] (and of [math]M\cdot M^H[/math], if that is more convenient) where [math]M[/math] is your original matrix and [math]M^H[/math] is the hermitian (if you’re dealing with real numbers, it’s the same as [math]M^T[/math]). The square matrices contain the corresponding eigenvectors of [math]M\cdot M^H[/math] and of [math]M^H \cdot M[/math].

It is often used to get a low-rank approximation of a matrix. You find the highest [math]k[/math] elements of the diagonal matrix and drop all other columns and rows, and do the same in the square matrices.

What does this mean? Well, this is easy to explain using a typical machine learning application. Imagine you have an online store with [math]N[/math] customers and [math]L[/math] items for sale. You keep track of what customers have bought what items. You keep all of that in a matrix [math]M[/math] of dimensions [math]N \times L[/math]. Each row corresponds to a customer, each column to an item.

Now, how would we calculate what customers buy the same kind of things as a given customer? Well, that’s easy: you calculate [math]M \cdot M^H[/math]. Each row of this will tell you, for each other customer, how many items they both bought. If this is high, they are very similar customers. You can do the same with the items, but here you use [math]M^H \cdot M[/math]. Of course, we can continue this: people who buy the same as people who buy the same as you. And so on, ad infinitum. There is a way to capture the limit of this process: the eigenvectors of [math]M \cdot M^H[/math]. These represent “stereotypical” users. The same for the items, you get “stereotypical items”. If the store is a book store, you could get a stereotypical book that corresponds to a combination of Lord of the Rings, Harry Potter, and Discworld which represented the “typical fantasy book”, for example.

If you now restrict the SVD to only [math]k[/math] (which is a few) columns, you are keeping the import stereotypes, but need to store a lot less data. If [math]k[/math] is large enough, you will get a very close approximation of the original data, but will need to store more. If you chose it to be smaller, it will require more data but will be more accurate.

The way this information can then be used is through the familiar “people who like the things you do also purchased this; you might be interested in this item too” functionality which is the backbone of recommender systems such as Amazon, YouTube, Netflix, and many others.

Profile photo for Meir Maor

The cool thing is that the algorithm for SVM with a kernel and without is exaxtly the same. The algorithm for finding the maximum margin separating hyper plane only looks at the data by applying the dot product operation.

With this observation we can attempt to efficiently transfer the problem to a different space(usually of higher dimension) without actually mapping to the new space and calculate the dot product in it. We can even dream up spaces with infinite dimensions we can't map to directly but we still know how to calculate dot product in it. This is what the famous RBF kernel does for e

The cool thing is that the algorithm for SVM with a kernel and without is exaxtly the same. The algorithm for finding the maximum margin separating hyper plane only looks at the data by applying the dot product operation.

With this observation we can attempt to efficiently transfer the problem to a different space(usually of higher dimension) without actually mapping to the new space and calculate the dot product in it. We can even dream up spaces with infinite dimensions we can't map to directly but we still know how to calculate dot product in it. This is what the famous RBF kernel does for example.

So it's the maximum margin hyper plane algorithm with the dot product replaced with a kernel function.

Profile photo for Quora User

A linear kernel is a linear function used as a kernel.

What is a kernel? Well, it is a function that tells you how “similar” two vectors are. A few simple functions:

  • [math]\vec{x}^T \cdot \vec{y}[/math]. This is the “linear” kernel function.
  • [math](\vec{x}^T \cdot \vec{y} + 1)^d[/math], which is a polynomial function.
  • [math]\exp(-\gamma||\vec{x} - \vec{y}||^2)[/math] (with some [math]\gamma > 0[/math]) which is behind the concept of a “radial basis function”, or RBF.
  • [math]\left. \begin{cases}1&\textrm{if }\; x = y\\0&\textrm{else}\end{cases}\right\}[/math] is very simple but not very useful in practice.

You should read the chapter on SVMs in your textbook to see

A linear kernel is a linear function used as a kernel.

What is a kernel? Well, it is a function that tells you how “similar” two vectors are. A few simple functions:

  • [math]\vec{x}^T \cdot \vec{y}[/math]. This is the “linear” kernel function.
  • [math](\vec{x}^T \cdot \vec{y} + 1)^d[/math], which is a polynomial function.
  • [math]\exp(-\gamma||\vec{x} - \vec{y}||^2)[/math] (with some [math]\gamma > 0[/math]) which is behind the concept of a “radial basis function”, or RBF.
  • [math]\left. \begin{cases}1&\textrm{if }\; x = y\\0&\textrm{else}\end{cases}\right\}[/math] is very simple but not very useful in practice.

You should read the chapter on SVMs in your textbook to see where the kernels are used, and why this is important, to really understand the concept and get an idea of what kernel to use when.

Profile photo for Vasily Konovalov

In case when the data is not linearly separable you can apply kernel function and hope that in the new higher dimension the data is indeed linearly separable.

Great introduction to the usage of kernel function in SVM you can find here: What are kernels in machine learning and SVM and why do we need them?

From my personal experience, kernel function is not omnipotent solution for linear separability. And there is no computationally effective kernel function (kernel trick) for every dataset. Obviously everything depends on the data. Therefore if you have a problem with separation and thinking abou

In case when the data is not linearly separable you can apply kernel function and hope that in the new higher dimension the data is indeed linearly separable.

Great introduction to the usage of kernel function in SVM you can find here: What are kernels in machine learning and SVM and why do we need them?

From my personal experience, kernel function is not omnipotent solution for linear separability. And there is no computationally effective kernel function (kernel trick) for every dataset. Obviously everything depends on the data. Therefore if you have a problem with separation and thinking about kernel trick, try to use Random Forest, in some cases it might find a better separation.

Profile photo for Chomba Bupe

When I use support vector machines (SVM) it is usually a linear SVM feeding on high-level features at the end of a model.

I don't really like using the kernel trick, I think it is better to have a much more powerful feature extractor like a pre-trained convolutional neural network (CNN) drive a linear SVM than use the shallow kernel trick.

The kernels may find a mapping in high dimensional space, which is cool and all, but it might overfit and not generalize as well the above mentioned approach.

Hope this helps.

Profile photo for Grzegorz Gwardys

But kernelized SVM classification or training ? If we are talking about training, kernelized SVM is slower, becasue for example a radial function needs more operations then simple dot product. If we are talking about classification, we can say, that it's much slower, because of no information about hyperplane in kernelized case - we have to calculate "distances" to all support vectors, instead of using a hyperplane equation.

some numbers (in Python, classification task, 5977 objects, with 256 dimension, 685 SVs):

Linear SVM (using hyperplane): 0.71 second
LInear SVM (calculate distances to all

But kernelized SVM classification or training ? If we are talking about training, kernelized SVM is slower, becasue for example a radial function needs more operations then simple dot product. If we are talking about classification, we can say, that it's much slower, because of no information about hyperplane in kernelized case - we have to calculate "distances" to all support vectors, instead of using a hyperplane equation.

some numbers (in Python, classification task, 5977 objects, with 256 dimension, 685 SVs):

Linear SVM (using hyperplane): 0.71 second
LInear SVM (calculate distances to all SVs): 255.65 seconds
RBF SVM: 1154.43 seconds

Profile photo for Rohit Sharma

SVM [ https://en.wikipedia.org/wiki/Support_vector_machine ] (aka Support Vector Machine) is primarily used as a large margin classification algorithm. It works by mapping the data items into a higher dimensional space to identify a hyperplane that acts as a decision boundary.

Consider, red and blue dots in the next picture shows represent of two classes that cannot be linearly separated in 2-D spa

SVM [ https://en.wikipedia.org/wiki/Support_vector_machine ] (aka Support Vector Machine) is primarily used as a large margin classification algorithm. It works by mapping the data items into a higher dimensional space to identify a hyperplane that acts as a decision boundary.

Consider, red and blue dots in the next picture shows represent of two classes that cannot be linearly separated in 2-D space.

But they can become linearly separated by a transformation ...

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·
© Quora, Inc. 2025