$\newcommand\array[1]{\begin{bmatrix}#1\end{bmatrix}}$ [MITx 6.86x Notes Index]
In this unit we start talking of problems where we don't have labels, where we need to find a structure in the data itself. For example, Googe news provide clusters of stories over the same event. It introduce a structure in the feed of news stories.
We start looking at hard assignment clustering algorithms, where we assume that each data point can belong to just a single cluster of (somehow to define) "similar" points.
We will then move to recognise data structures where each point can belong to multiple clusters, trying to discover the different components embedded in the data and looking at the likelihood a certain point was generated by that component.
And we will introduce generative model, which provide us the probability distribution over the points. We will look at multinomials, Gaussian and mixture of Gaussians. And by the end of this unit, we will get to a very powerful unsupervised learning algorithm for uncovering then, the line structure, which is called the expectation–maximization (EM) algorithm.
We will finally practice both clustering and the EM algorithm in a Netflix-recommandation project in Project 4.
- Understand the definition of clustering
- Understand clustering cost with different similarity measures
- Understand the K-means algorithm
The topic in this unit is unsupervised machine-learning algorithms. In unsupervised learning, we will still have a training set, but we will not have a label like in supervised learning.
So in this case, we are provided with just a set of feature vectors, and we are trying to learn some meaningful structure which would represent this set. In this lesson in particolar we will cover clustering algorithms.
We will first start from the definition of a cluster.
While in a classification problem we already have the possible categories and the label associated to it in the training data, in clustering we need to find a structure in the data and associate it with features of the data itself, in order to predict the category from the data itself without the need for a explicit label.
For example take this chart:
It is intuitive that the points belong to two different categories, in this case based on their position on the (x,y) axis, even if the points have no label (colour) associated to them.
Similarly, in many cases the meaningful structure is actually quite intuitive in the context of a specific application.
Find a structure is however not enough. We need to be able to describe how the structure we found is good in representing the data, how good is the cluster, the partitioning of the data (the "clustering cost"). And we will need for that a measure of similarity between points.
We will finish the lecture dealing with the K-means algorithm, an important and widely used algorithm which actually can take any set of training point and find a partitioning to K clusters.
To conclude the segment with an example, let's consider again Google news. In google News we have both an example of classification and one of clustering.
The grouping of the news in one of the categories (sport, science, business,..) is a categorisation task. The categories are fixed, and someone has trained the algorithm manually setting the categories of each news for a training set, so that now it can be predicted automatically.
The grouping of news around individual new events (the covid-19, the brexit, the US elections...) is instead a clustering problem: here the categories are not predefined and there has been no training with manually label all possible events.
What do we need for such task?
- We first need to provide a representation of the news stories, for example employing a bag-of-words approach, giving a (sparse) vector representation to each word;
- Second, we need to decide how to compare each of these representations, in this case vectors;
- Finally we need to implement the clustering algorithm itself, based on the pairwise measure of proximity found in step two, or, for example, selecting one representative story to show as a central story in the news.
If the previous example was dealing with text clustering for visualisation purposes, let's now deal with image clustering for compression purposes, and more specifically for creating a colour palette for the image, a form of image quantisation ("quantisation" refers to the process of mapping input values from a large setto output values in a countable smaller set, often with a finite number of elements).
A typical 1024 x 1024 pixel image coded in 24 bits per pixel (8 bits per each of the RGB colours) would size 1024x1024x24 = 25165824 bits = 3 MB.
If we create a palette restricted to only 32 colours, and represent each pixel with one of the colours in this palette, the image then would size only 1024x1024x5+5*24 = 5243000 bits = 640.01 KB (this by the way is a compression technique used in PNG and GIF images).
Note that the image quality depends from the number of colours of the palette we choose, i.e. the number of clusters K. While cartoon, screenshots, logos would likely require very small K to appear of a quality similar to the original image, for photos we would likely need higher K to achieve the same quality.
Still, to get the compressed image we need to undergo the three steps described in the previous segment. We need to represent the image (as a vector of pixels with a colour code). The second step is to find some similarity measure between colours, and finally we can run the clustering algorithm to replace each pixel 24 bits representation with the new 5 bits one, choosing a map from the 12777216 original colours (24 bits) to the 32 "representative ones" (also to be found by the algorithm!).
There are two concepts related to a cluster. The first one is the partitioning of the data, how the different data records are distributed among the clusters. The second one is the representativeness, how we aggregate the cluster members in a representative element per each cluster.
Let's
This partitioning define a so-called hard clustering, where every element just belongs to one cluster.
The representativeness view of clustering is the problem of how to select the feature vector to be used as representative of the cluster, like the new story to put as header on each event in Google News or the colour to be use for each palette item. Note that the representative element for each cluster doesn't necessarily be one existing element of the cluster set, can be also a combination of them, like the average.
We will see later in this lesson what is the connection between these two views. Clearly they are connected, because if you know what is your partitioning, you may be able to guess who is the representative. If you know who is a representative, you may be able to guess who is a constituent, but we will see how we can actually unify these two views together.
Note that finding the optimal number of K is itself another problem, as
Given the same set of feature vectors, we can have multiple clustering results, multiple way to partition the set.
We need to define a "cost" to each possible partition of our data set, in order to rank them (and find the one that minimise the cost).
We will make the assumption that the cost of a given partition is the sum of the cost of its individual clusters, where we can intuitively associate the cost to some measure of the cluster heterogeneity, like (a) the cluster diameter (the distance between the most extreme feature vectors, i.e. the outliers), (b) the average distance or (c) (what we will use) the sum of the distances between every member and
In turn we have to choose which metric we use to measure such distance. We have several approaches, among which:
- cosine distance,
$dist(x^{(i)},x^{(j)}) = \frac{x^{(i)} \cdot x^{(j)}}{||x^{(i)}|| * ||x^{(j)}||}$ - square euclidean distance,
$dist(x^{(i)},x^{(j)}) = ||x^{(i)} - x^{(j)}||^2$
The cosine distance has the merit of being invariant to the magnitude of the vectors. For example, in terms of the Google News application, if we choose cosine to compare between two vectors representing two different stories, this measurement will not take it into account how long are the stories. On the other hand, the Euclidean distance would be sensitive to the lengths of the story.
While cosine looks at the angle between vectors (thus not taking into regard their weight or magnitude), euclidean distance is similar to using a ruler to actually measure the distance (and so cosine is essentially the same as euclidean on normalized data)
Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter. This happens for example when working with text data represented by word counts. We could assume that when a word (e.g. science
) occurs more frequent in document 1 than it does in document 2, that document 1 is more related to the topic of science. However, it could also be the case that we are working with documents of uneven lengths (Wikipedia articles for example). Then, science
probably occurred more in document 1 just because it was way longer than document 2. Cosine similarity corrects for this (from
https://cmry.github.io/notes/euclidean-v-cosine).
While we'll need to understand which metric is most suited for a specific application (and where it matters in the first instance), for now we will employ the Euclidean distance.
Given these two choices, our cost function for a specific partition becomes:
Note that for now we are giving this cost function as a function of both the partition and the representative vectors, but when we will determine how to compute the representative vectors
How to find the specific partition that minimise the cost ? We can't iterate over all partitions, measure the cost and select the smaller one, as the number of (non empty) k-partitions of a set of n elements is huge.
To be exact it is known as the "Stirling numbers of the second kind" and it is equal to:
For example,
The K-M algorithm helps in finding the minimal distance in a computationally feasible way.
In a nutshell, it involves ( a ) to first randomly select k representative points. Then ( b ) iterate for each point to assign the point to the cluster of the closest representative (according with the adopted metric), and ( c ) move each representative at the center of its newly acquired cluster (where "center" depends again from the metric). Steps ( b ) and ( c ) are reiterated until the algorithm converge, i.e. the tentative k representative points (and their relative clusters) don't move any more.
Graphically:
Random selection of 3 representative points in 2D:
Assignment of the "constituent" of each representative:
Moving the representative at the center of (the previous step) constituency:
Redefinition of the constituencies (note that some points have changed their representative):
Moving again the representatives and redefinition of the constituencies:
Let's describe this process a bit more formally in terms of the cluster costs.
-
- Randomly select the representatives
$z^{(1)},...,z^{(j)},...,z^{(K)}$
- Randomly select the representatives
-
- Iterate:
- 2.1. Given
$z^{(1)},...,z^{(j)},...,z^{(Z)}$ , assign each data point$x^{(i)}$ to the closest representative$z^{(1)},...,z^{(j)},...,z^{(Z)}$ , so that the resulting cost will be$cost(z^{(1)},...,z^{(j)},...,z^{(Z)}) = \sum_{i=1}^n min_{j = 1,...,K}||x^{(i)} - z^{(j)}||^2$ - 2.2. Given partition
$C_1,...,C_j,...,C_K$ , find the best representatives$z^{(1)},...,z^{(j)},...,z^{(Z)}$ such to minimise the total cluster cost, where now the cost is driven by the clusters:$cost(C_1,..,C_j,...,C_K) = min_{z^{(1)},...,z^{(j)},...,z^{(Z)}}\sum_{j=1}^K \sum_{i \in C_J} ||x^{(i)} - z^{(j)}||^2$
Note that in both step 1 and step 2.2 we are not restricted that the initial and final representatives are part of the data sets. The fact that the final representatives will be guarantee to be part of the dataset is instead one of the advantages of the K-medoids algorithm presented in the next Lesson.
While for step (2.1) we can just iterate for each point and each representative to find the representative for each point that minimise the cost, we still need to define how to do exactly the step (2.2). We will see later extensions to the KM algorithm with other distance metrics, but for now let's stuck with the squared geometric distance and note that each cluster select its own representative independently.
Using the squared Euclidean distance, the "new"
When we compute the gradient of the cost function with respect to
We stress however that this solution is linked to the specific definition of distance used, the squared Euclidean one.
Note that the KM algorithm is guarantee to converge and find a local cost minimisation, because at each iteration the cost can only decrease, the cost function is non-negative and the number of possible partitions, however large, is finite.
It doesn't however guarantee to find a global cost minimisation, and it is indeed very sensitive to the choice of the initial representative vectors. If we start with a different initialization, we may get a very different partitioning.
Take the case of the above picture:
In the upper part we see the initial random assignation of the representative vectors, and in the bottom picture we see the final assignment of the clusters. While the feature vectors moved a bit, the final partition is clearly not the optimal one (where the representative vectors would be at the center of the three groups).
In particular, we may run into troubles when the initial representative vectors are close to each others rather than spread up across the multidimensional space.
While there are improvements to the K_Mean algorithm to perform a better initialisation than a random one, that take this consideration into account, we will use in class a vanilla KM algorithm with simple random initialisation. An example of a complete such KM algorithm in Julia can be found on https://github.com/sylvaticus/lmlj/blob/master/km.jl
In general we can say there are two classes of drawbacks of the K-M algorithm.
The first class is that the measure doesn't describe what we consider as a natural classification, so the cost function doesn't represent what we would like the cost of the partition to be, it does not return a useful information concerning the partition ranking.
The second class of problems involves instead the computational aspects to reach the minimum of this cost, like the ability to find a global minimum.
While K-M algorithm scale well to large datasets, it has many other drawbacks.
One is that vanilla K-M algorithm tries to find spherical clusters in the data, even when the groups have other spatial grouping:
To account for this, k-means can be kernelized, where separation of arbitrary shapes can be reached theoretically using higher dimensional spaces. Or there could be used a regularized gaussian mixture model, like in this paper or in these slides.
Other drawbacks, includes the so called "curse of dimensionality", where k-means algorithm becomes less effective at distinguishing between examples, the manual choice of the number of clusters (that need to be solved using cross-validation), the fact of not being robust to outliers.
These further drawbacks are discussed, for example, here, here or here.
An implementation of K-Means algorithm in Julia: https://github.com/sylvaticus/lmlj.jl/blob/master/src/clusters.jl#L65
Objectives:
- Understand the limitations of the K-Means algorithm
- Understand how K-Medoids algorithm is different from the K-Means algorithm
- Understand the computational complexity of the K-Means and the K-Medoids algorithms
- Understand the importance of choosing the right number of clusters
- Understand elements that can be supervised in unsupervised learning
On top of the limitations already described in segment 13.8 (computational problem to reach the minimum and the cost function not really measuring what we want) there is further significant limitation of the K-Mean algorithm: the fact that the z's are actually not guaranteed to be the members of the original set of points x.
In some applications this is not a concern, but it is for others. For example, looking at Google News, if we create a representative of the cluster of the story which doesn't correspond to any story, we actually have nothing to show.
We will now introduce the K-medoids algorithm that modifies the k-means one to consider any kind of distance (not only the squared Euclidean one) and return a set of representative vectors that are always part of the original data set.
We will finish the lesson discussing how to choose K, the number of K.
In K-means, whenever we randomly selected the initial representative points, we were not constrained to select within the data set points, we could select any point on the plane.
In K-Medoids algorithm instead we start by selecting the representatives only from within the data points.
The step 2.1 (determining the constituencies of the representatives) remains the same, while in step 2.2, instead of choosing the new representatives as centroids, we constrain again that these have to be one of the point of the cluster. This allow to just loop over all the points of the cluster to see which minimise the cluster cost. And as in both step 2.1 and 2.2 we explicitly use a "distance" function we can employ any kind of distance definition we want, i.e. we are no longer restricted to use the squared Euclidean one.
The algorithm becomes then:
-
- Randomly select the representatives
$z^{(1)},...,z^{(j)},...,z^{(K)}$ from within$X^{(1)},...,X^{(j)},...,X^{(n)}$
- Randomly select the representatives
-
- Iterate:
- 2.1. Given
$z^{(1)},...,z^{(j)},...,z^{(Z)}$ , assign each data point$x^{(i)}$ to the closest representative$z^{(1)},...,z^{(j)},...,z^{(Z)}$ , so that the resulting cost will be$cost(z^{(1)},...,z^{(j)},...,z^{(Z)}) = \sum_{i=1}^n min_{j = 1,...,K}||x^{(i)} - z^{(j)}||^2$ - 2.2. Given partition
$C_1,...,C_j,...,C_K$ , find the best representatives$z^{(1)},...,z^{(j)},...,z^{(Z)}$ from within$X^{(1)},...,X^{(j)},...,X^{(n)}$ such to minimise the total cluster cost, where now the cost is driven by the clusters:$cost(C_1,..,C_j,...,C_K) = \sum_{j=1}^K min_{z^{(j)}}\sum_{i \in C_J} dist(x^{(i)} - z^{(j)})$
Let't now compare the computational complexity of the two algorithms, using the capital O (or "Big-O") notation, which talks to us about the order of growth, which means that we are going to look at the asymptotic growth and eliminate all the constants.
We often describe computational complexity using the “Big-O" notation. For example, if the number of steps involved is
More formally, a function
In other words, the function
The big-O notation can be used also when there are more input variables. For example, in this problem, the number of steps necessary to complete one iteration depends on the number of data points
We don't know how many iterations the algorithms take, so let's compare only the complexity of a single iteration.
The order of computation for the step 2.1 of the K-Mean algorithm is $O(nkd)$ as we need to go trough each point (n), compute the distance with each representative (k) and account that we deal with multidimensional vectors (d). The step 2.2 is the same, but because we're talking about the asymptotic growth, whenever we sum them up, we're still staying within the same order of complexity.
For the K-Medoids algorithm instead we can see that is much more complex, as in the step 2.2 we need to go trough all the clusters, and then all the point.
Depending how the data is distributed across the cluster, we can go from the best case of all points in one cluster, and then we would have a computational complexity of
Given that normally
At this point we already have seen two clustering algorithms, but there are hundreds of them available for our use. Each one has different strengths and weaknesses, and whenever we are selecting a clustering algorithm which fits for our application, we should account for all of them in order to find the best clustering algorithm for our needs.
First, let's recognise that more K we add to a model, more the cost function decreases, as the distances from each point and the closer representative will decrease, until the degenerated case where the number of clusters equals the number of data and the cost is zero.
When is it time then to stop the number of clusters ? To decide which is the "optimal" number K ? There are three general settings. In the first one the number of clusters is fixed, is really given to us by the application. For example we need to cluster our course content in 5 recitations, this is the space we have. Problem "solved" :-)
In the second setting, the number of clusters is driven by the specific application, in the sense that the optimal level of the trade-off between cluster cost and K is determined by the application, for example how many colours to include in a palette vs the compression rate of the image depends from the context of the application and our needs.
Finally, the "optimal" number of clusters could be determined in a cross-validation step when clustering is used as a pre-processing tool for a supervised learning tool (for example to add a dimension "distance from the cluster centroid" when we have few labelled data and many unlabelled ones). The supervised task algorithm would then have more information to perform its separation task. Here it is the result of the supervised learning task during validation that would determine the "optimal" number of clusters.
Finally let's have a thought on the word "unsupervised". One common misunderstaning is that in "usupervised" tasks, as there is no labels, we don't provide our system with any knowledge, we just "let the data speak", decide the algorithm and let it provide with a solution. Indeed, the people who develop those unsupervised algorithms actually provide quite a bit of indirect supervision, of expert knowledge.
In clustering for example we decide about which similarity measure to use and we decide about how many clusters to give and, as we saw, ow many clusters provide. And if you're thinking about bag of words approach of representing text, these algorithms cannot figure out which words are more semantically important than other words. So it would be up to use to do some weighting, so that the clustering results is actually acceptable.
Therefore, we should think very carefully how to do these decision choices so that our clustering is consistent with the expectation.
An implementation of K-Medoids algorithm in Julia: https://github.com/sylvaticus/lmlj.jl/blob/master/src/clusters.jl#L133
- Understand what Generative Models are and how they work
- Understand estimation and prediction phases of generative models
- Derive a relation connecting generative and discriminative models
- Derive Maximum Likelihood Estimates (MLE) for multinomial and Gaussian generative models
The model we studies up to now were discriminative models that learn explicit decision boundary between classes. For instance, SVM classifier, which is a discriminative model, learns its decision boundary by maximising the distance between training data points and a learned decision boundary.
At the contrary, generative models work by explicitly modelling the probability distribution of each of the individual classes in the training data. For instance, Gaussian generative models fit a Gaussian probability distribution to the training data in order to estimate the probability of a new data point belonging to different classes during prediction.
We will study two types of generative models, Multinomial generative models and Gaussian generative models, where for both we will ask two type of questions: (1) how do estimate the model ? How do we fit our data (the "estimation" question) and (2) how do we actually do prediction with a (fitted) model ? (the "prediction" question)
What are the advantages of generative models when compared to the discriminative ones?
For example, if your task is not just to do classification but to generate new samples of such (feature, label) pair based on the information provided in training data.
We now think to a data-generator probabilistic model where we have different categories and hence a discrete probability distribution (a PMF, Probability Mass Function).
We name
Given such probabilistic model, the generative model
For example the probabilistic model may be a words generating model, given fixed vocabulary of
If we are interested in the document as a specific sequence of words, for example "IT WILL RAIN IT WILL" then it's probability is P("IT") * P("WILL") * P("RAIN") * P("IT") * P("WILL") = P("IT")² * P("WILL")² * P("RAIN"). More in general it is
If we are interested instead in all the possible combination of documents that use that specific number of words (in whatever order), {"IT":2,"WILL":2,"RAIN":1}, we have to consider and count all the possible combinations, like "IT WILL IT WILL RAIN", and there are
Note that the categorical and multinomial distributions are nothing else than an extension of respectively the Bernoulli and binomial distributions to multiple classes (rather than just a binary 0/1 ones). They model the probability of respectively one or multiple, independent, categorical trials, i.e. the outcome for the categorical distribution is the single word, while those for the multinomial distribution is the whole document (encoded as world count).
For completion, there are
Even if we will use the term "multinomial generative model", in this lecture we will consider the first approach of encoding a document and its relative probabilistic model (the categorical distribution)
Note indeed that in some fields, such as machine learning and natural language processing, the categorical and multinomial distributions are conflated, and it is common to speak of a "multinomial distribution" when a "categorical distribution" would be more precise.
So, for a particular document D it's probability to be generated by a sequence of samples from a categorical distribution with parameter
Given an observed document and compelling probabilistic models (with different thetas) we can then determine wich is the one with the highest probability of having generated the observed document.
Let's consider for example a vocabulary of just two words, "cat" and "dog" and an observed Document "cat cat dog".
Let's also assume that we have just two compelling models to choose from. The first (probabilistic) model has
The joint probability of a set of given outcomes when we want to stress it as a function of the parameters of the probabilistic model (thus treating the random variables as fixed at the observed values) is known as the likelihood.
While often used synonymously in common speech, the terms “likelihood” and “probability” have distinct meanings in statistics. Probability is a property of the sample, specifically how probable it is to obtain a particular sample for a given value of the parameters of the distribution; likelihood is a property of the parameter values.
We want now to find which are the parameters that maximise the likelihood.
For the multinomial model it is $\text{argmax}\theta {L(D|\theta_0) = \prod{w \in W} \theta_w^{\text{count}(w)}}$.
It turns out that maximising its log (known as the log-likelihood) is computationally simpler while equivalent (in terms of the argmax, not in terms of its value) because the log function is a monotonically increasing function: wherever the likelihood is on its maximum, its log it is on its maximum as well.
We hence compute the first order conditions in terms of theta for the log-likelihood to find the theta that maximise it:
Let's first consider the specific case of just two categories to later generalise (and let's calling the outcomes just 0/1 instead of cat/dog).
In such setting we have really just one parameter,
Taking the derivative with respect to
Note the symbol of the tilde to indicate an estimated parameter. This is our "best guess" parameter given the observed data.
Let's now consider the full multinomial case.
So given a document (or, as the words are assumed to be independent, a whole set of documents.. just concatenated one to the other) and a vocabulary W we want to find
The problem is however that the thetas are not actually free, but we have the constrain that the thetas have to sum to one (we have actually also those that the thetas need to be positive, but for the nature of this maximisation problem we can ignore this constraint).
We have hence a constrained optimisation problem that we can solve with the method of the Lagrange multipliers, a method to find the stationary values (min or max) of a function subject to equality constraints.
Let's consider a case of an objective function of two variables
We can then write the so-called Lagrangian function as
All we need to do now is to find the stationary values of ℒ, regarded as a function of the original variables
The First Order necessary Condition (FOC) are:
$\begin{split} ℒ_x & = & f_x -\lambda g_x & = & 0\ ℒ_y & = & f_y -\lambda g_y & = & 0\ ℒ_\lambda & = & c-g(x,y) & = & 0\ \end{split}$
Solving this system of 3 equations in 3 unknown will equivalently find the stationary point of the unconstrained function ℒ and original function
For a numerical example, let's the function to optimize be
$\begin{split} ℒ_x & = & y -\lambda & = & 0\ ℒ_y & = & x -\lambda & = & 0\ ℒ_\lambda & = & 6 - x - y & = & 0\ \end{split}$
Solving for
As the sum of thetas must be one, the Lagrangian of the log-likelihood is:
where
The FOC of the lagrangian are then:
$\begin{split} ℒ_w & = & \frac{n_w}{\theta_w} -\lambda & = & 0\ ℒ_\lambda & = & \sum_{w \in W} \theta_w - 1= & 0\ \end{split}$
Where the first equation is actually a gradient with respect to all the
Solving the above system of equation we obtain
These set of
Let's now see how can we actually use generative Multinomial Models to make predictions, for example deciding if a document is in Spanish of Portuguese language.
For that we have first given a large set of documents in both languages, together with their label (the language).
From there we can estimate the PMF of the words using the MLE estimator (i.e., counting the relative proportions) for both Spanish and Italian (so our vocabulary would have both Spanish and Italian terms).
We are now given a new document without the language label. Calling the two languages as "+" and "-", we compute the likelihood of the document under the first hypothesis ($P(D|\theta^+)$) as the joint probability using the first PMF , and the likelihood under the second hypothesis ($P(D|\theta^-)$) using the second PMF, and we decide how to classify the document according to which one is bigger.
The probability of a data item to belong to a given class, conditional to the probabilities of its feature vector is known also as class-conditional probability.
We can equivalently say that we category the document as "+" if
But
where
The last expression shows that this classifier, derived looking through a generative view on classification, should actually remind us a linear classifier that goes through origin with respect to this parameter
Let's now try to compute
To compute it, we will apply the Bayesian rule that states:
and the total probability law that states:
Where
We can think for example to a bilingual colleague giving us a document in either Spanish or Portuguese.
Going back to our plus/minus class notation instead of the languages, we can compute the log of the ratio between the posteriors for the two classes as:
Using the expression from the previous segment for the first term and
So now what we actually see here that we translated it again to a linear classifier. But in contrast to the previous case, when we had a linear classifier that went through origin, now we have a linear classifier with an offset, and the offset itself would be actually guided by our prior, which will drive the location of the separator.
So what we've seen here that in this model we can very easily incorporate our prior knowledge about the likelihood of certain classes. And at the end, what we got, that even though we're talking about generative models and we're using a different mechanism on some ways of estimating the parameters of this multinomial, at the end, we actually are getting the same linear separators that we see in our discriminative modelling.
We will now switch, in place to a categorical distribution, to a generative (probabilistic) model based on the multidimensional Normal (or "Gaussian") distribution.
While the categorical distribution was opportune for modelling discrete random variables, like classes, the Normal is the analogous counterpart for continuous random variables.
The multidimensional Normal has Probability Density Function (PDF):
where
So the PDF has a single spike in correpondance of
When all the components (dimensions) are uncorrelated (i.e. the covariant matrix
We should also specify that a random vector
It is important to note that in generative models we are introducing a specific functional form for the probabilistic model (here the gaussian), and our freedom relates only on the parameters of the model, not on the form itself. Hence we first need to consider if our data, our specific case, is appropriate to be modelled by such functional form or not.
As for the multinomial case, if we have a training set
The likelihood is
The log-likelihood is then:
From this expression we can take the derivatives of
Objectives:
- Review Maximum Likelihood Estimation (MLE) of mean and variance in Gaussian statistical model
- Define Mixture Models
- Understand and derive ML estimates of mean and variance of Gaussians in an Observed Gaussian Mixture Model
- Understand Expectation Maximization (EM) algorithm to estimate mean and variance of Gaussians in an Unobserved Gaussian Mixture Model
So far, in clustering we have assumed that the data has no probabilistic generative model attached to it, and we have used various iterative algorithms based on similarity measures to come up with a way to group similar data points into clusters. In this lecture, we will assume an underlying probabilistic generative model that will lead us to a "natural" clustering algorithm called the EM algorithm .
While a “hard" clustering algorithm like k-means or k-medoids can only provide a cluster ID for each data point, the EM algorithm, along with the generative model driving its equations, can provide the posterior probability (“soft" assignments) that every data point belongs to any cluster.
Today, we will talk about mixture model and the EM algorithm. The rest of the segment is just a recap of the two probabilistic models we saw, the multinomial (actually the categorical) and the multidimansional Gaussian.
The corresponding statistical model is, given observed data and the assumption that the data has been generated by the given probabilistic model, which is the most likely parameter of the distribution, like the individual category probabilities
This is known in statistics as parametric estimation, and can be solved by interpreting the joint probability function of a set of observed data in terms of a function of the parameter of the model given the observed data (and in doing this we name the joint probability function "likelihood") and then find the parameter(s) of the model that maximise it, using first order conditions, eventually employing Lagrange Multipliers if the maximisation involves respecting some contraints, like setting the sum of all the categorical probabilities equal to one.
Please also consider that the sum of independent normal random variables remains normal:
Let's know consider a generative (probabilistic) model consisting of two subsequent steps. We first draw a class using a K-categorical distribution with probabilities
We then draw the final data from a Normal distribution with parameter
This generative model goes under the name of mixture model and mixture models are actually very useful for describing many real-life scenarios.
The full set of parameters of this mixture model is then collectively represented by
The output of sampling a set of points from such model, using 2 components (two classes of the categorical distribution) and a two-dimensional Gaussian, would look something like this:
We can intuitively distinguish the two clusters and see that one is bigger than the other, both in terms of points (i.e. its
But where is the border between the two points? While the K-Mean or K-Medoids algorithm would have provided a classifier, returning one cluster id for each point (hard clustering model), a clustering based on a mixture generative model (to be implemented using the EM algorithm we will discuss in the next segment) would instead return for each point the probability to belong to each cluster (soft clustering model). For example, the points in the top-left corner of the picture will have probability to belong to the first cluster almost one and those in the bottom-right corner probability to belong to the second cluster (i.e. being generated by a Normal distribution with parameters $(\mu_2,\sigma_2^2)$) almost 1, but those in the center will have much more balanced probabilities.
So here, we have a much more refined way to talk about our distribution, because every point kind of belongs to different components, just with different likelihoods.
If we know all of the parameters of the K Gaussians and the mixture weights, the probability density function
Using the Bayes rule we can also find the posterior probability that
Given the GMM we can find the PDF associated with each point x using the total probability theorem:
The likelihood associated to a set
Now, given this expression, we want to take the derivatives of it with respect to my parameters, make them equal to 0, and find the parameters. It turns out however that's actually a pretty complex task.
We will first start to reason in a setting where we know the category associated to each point, to further generalise in a context where we don't know the class from which it belongs, with the EM algorithm.
Note also that while we will consider different means of the Gaussian across its dimensions (i.e.
Let's hence first assume that we know the cluster
For that let's define as indicator function a function that, for a given pair
$\mathbf {1}(j;i):= {\begin{cases} 1 & {\text{if }} x^{(i)} \in \text{cluster}_j, \ 0 & {\text{if otherwise}}. \end{cases}}$
The log-likelihood can then be rewritten using such indicator function and inverting the summation as:
Note that in general
And what this expression actually demonstrates is that, whenever we are trying to find all the parameters that optimize for each mixture component, we can actually do this computation for each cluster independently. We can independently find the best parameters that fit the cluster, because the fact that these points are observed, we actually know where the point is belonging to. So we can separately find, here the mu, sigma and the p.
By maximising the likelihood, we would find the following estimators:
-
$\hat n_j$ , the number of points of each cluster:$\hat n_j= sum_{i=1}^n \mathbf {1}(j;i)$ (this actually just derives from our assumption to know the indicator function output) $\hat p_j = \frac{\hat n_j}{n}$ $\hat \mu^{(j)} = \frac{1}{\hat n_j} \sum_{i=1}^n \mathbf {1}(j;i) * x^{(i)}$ $\hat \sigma_j^2 = \frac{1}{\hat n_j d} \sum_{i=1}^n \mathbf {1}(j;i) * ||x^{(t)}-\mu^{(j)}||^2$
These equations show how, given the observed case, when we know to which component each point belong, we can estimate all the parameters that we need to define our mixture of Gaussian.
We now consider the unobservated case, where we don't know the mapping between data
The notation of the indicator function implies a hard counting: the point belongs to one cluster with certainty or it doesn't belong. But one of the powers of mixture model is that we can see each point to doesn't solely belong to one single cluster, we know that it can be generated from different ones with different probabilities. So now, instead of talking about hard counts - belong or doesn't - we'll actually talk about soft counts: how much does this point belongs to this particular cluster?
What we want is
If we would know all the parameters we could compute it using the Bayes rule:
Indeed
But unfortunately there is no closed solution for maximising the likelihood
We need hence to break these dependence with a solving procedure in two parts: given my best estimated parameters, I first compute
-
$\hat n_j$ , the number of points of each cluster:$\hat n_j= sum_{i=1}^n p(j|i)$ (this is actually a weighted sum of the points, where the weights are the relative probability of the points to belong to cluster$j$ ) $\hat p_j = \frac{\hat n_j}{n}$ $\hat \mu^{(j)} = \frac{1}{\hat n_j} \sum_{i=1}^n p(j|i) * x^{(i)}$ $\hat \sigma_j^2 = \frac{1}{\hat n_j d} \sum_{i=1}^n p(j|i) * ||x^{(t)}-\mu^{(j)}||^2$
At this point I can re-compute the
Finding the posterior
Of course we are still left with the problem of how to initiate all the parameters for the first E step.
The EM algorithm indeed is very sensitive to initial conditions as it guarantee to find only a local maximisation of the
We could either do a random initialization of the parameter set
Which is the Likelihood of the observed dataset
We could try to directly maximize that for all the
Rather let's try to implement a Likelihood for each single cluster, and try to maximise them.
The fist step is to find the weights to relate each
Let's use the Bayes rule for that, noticing that the joined density to have both class
Given that
We can now rewrite the likelihood using a different probability for each
At this point we go back to the E step until the likelihood converges to a (local) maximum.
An implementation of E-M algorithm in Julia: https://github.com/sylvaticus/lmlj.jl/blob/master/src/clusters.jl#L222
The reason while the EM is guarantee to converge (although on a local extreme) is to be found in the Jensen's inequality (details in tab 3 or this thread).
We would like to extend our EM algorithm for the case of the problem of matrix completion, which is a very important problem in machine learning. We've already seen how to do matrix completion using K nearest and Matrix Factorisation approaches (Lecture 7), and this is another approach to do matrix completion, which assumes a statistical model for the underlying data that we observe.
In this model there are
Our task is to run the EM algorithm to find the posteriors
The only differences with the EM algorithm we saw earlier is that:
-
(1) we need to account that the learning of the parameters have to come only from the available data, i.e. we need to "mask" our
$x_n$ and$\mu_k$ arrays for accounting only of the dimensions we have. The computation of the posteriors in the M step becomes:$p(k \mid n) = \frac{\pi_{k}N(x_{C_{n}}^{(n)};\mu_{C_{n}}^{(k)},\sigma_{k}^{2}I_{C_{n}\times C_{n}}) }{ \sum_{k=1}^{K}\pi_{k}N(x_{C_{n}}^{(n)};\mu_{C_{n}}^{(k)},\sigma_{k}^{2}I_{C_{n}\times C_{n}}) }$ .The FOC for
$\mu$ and$\sigma^2$ that maximise the likelihood in the m step become:$\widehat{\mu_d^{(k)}} = \frac{\sum_{n = 1}^N p(k \mid n) C_{(n,d)} x_d^{(n)}}{\sum_{n=1}^N p(k \mid n) C_{(n,d)}}$ $\widehat{\sigma _ k^2} = \frac{1}{\sum_{n=1}^N |C_n| p(k \mid n)} \sum_{n = 1}^N p(k \mid n) | x_{C_n}^{(n)} - \widehat{\mu_{C_n}^{(k)}}| ^2$
Where
$C$ is the mask matrix. -
(2) When the dimensions are so much, or when there are few (or even none) observations for a user, would lead to numerical instabilities (very low value of the normal PDf, or divisions by zero). Some computational tricks are then needed, like working ion the log-domain (including computing the log-Normal instead of the "normal" Normal ;-) ), updating the
$\mu$ only when its denominator is higher than zero (to avoid erratic behaviour) or keeping a minimum variance (suggested:0.25
).