There is little use in assuming the data to consist of exactly 3 normal distributions. However, you may assume that there is more than one type of cars and trucks and bikes. So instead of training a classifier for these three classes, you cluster cars, trucks and bikes into 10 clusters each or maybe 10 cars, 3 trucks and 3 bikes, whatever , then train a classifier to tell apart these 30 classes, and then merge the class result back to the original classes.
You may also discover that there is one cluster that is particularly hard to classify, for example Trikes. They're somewhat cars, and somewhat bikes. Or delivery trucks, that are more like oversized cars than trucks.
Other answers being good, i will try to provide another perspective and tackle the intuitive part of the question. EM Expectation-Maximization algorithm is a variant of a class of iterative algorithms using duality.
In mathematics, a duality, generally speaking, translates concepts, theorems or mathematical structures into other concepts, theorems or structures, in a one-to-one fashion, often but not always by means of an involution operation: if the dual of A is B, then the dual of B is A.
Such involutions sometimes have fixed points , so that the dual of A is A itself. Usually a dual B of an object A is related to A in some way that preserves some symmetry or compatibility.
In a similar fashion, the EM algorithm can also be seen as two dual maximization steps :. The E-step maximizes this function with respect to the distribution over the unobserved variables; the M-step with respect to the parameters..
In an iterative algorithm using duality there is the explicit or implicit assumption of an equilibrium or fixed point of convergence for EM this is proved using Jensen's inequality. Note that when such an algorithm converges to a global optimum, it has found a configuration which is best in both senses i. However the algorithm can just find a local optimum and not the global optimum. For the statistical arguments and applications, other answers have given good explanations check also references in this answer.
There is also a youtube video that explains the paper in more detail. In the case of the first trial's question, intuitively we'd think B generated it since the proportion of heads matches B's bias very well This may be an oversimplification or even fundamentally wrong on some levels , but I hope this helps on an intuitive level! The comments to his answer show that the algorithm gets stuck at a local optimum, which also occurs with my implementation if the parameters thetaA and thetaB are the same.
The core part of the implementation is the loop to run EM until the parameters converge. How are we doing? Please help us improve Stack Overflow.
Take our short survey. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Collectives on Stack Overflow. Learn more. What is an intuitive explanation of the Expectation Maximization technique? Asked 9 years, 3 months ago. Active 2 years, 6 months ago. Viewed 44k times. Improve this question. London guy London guy What is the expectation maximization algorithm? You can look at this question math.
Updated link to the picture that chl mentioned. Add a comment. Active Oldest Votes. Suppose we have some data sampled from two different groups, red and blue: Here, we can see which data point belongs to the red or blue group.
Everything looks purple to us: Here we have the knowledge that there are two groups of values, but we don't know which group any particular value belongs to. Can we still estimate the means for the red group and blue group that best fit this data? The very general idea behind the algorithm is this: Start with an initial estimate of what each parameter might be.
Compute the likelihood that each parameter produces the data point. Calculate weights for each data point indicating whether it is more red or more blue based on the likelihood of it being produced by a parameter. Combine the weights with the data expectation. Compute a better estimate for the parameters using the weight-adjusted data maximisation.
Repeat steps 2 to 4 until the parameter estimate converges the process stops producing a different estimate. These steps need some further explanation, so I'll walk through the problem described above. Example: estimating mean and standard deviation I'll use Python in this example, but the code should be fairly easy to understand if you're not familiar with this language.
Specifically, each group contains a value drawn from a normal distribution with the following parameters: import numpy as np from scipy import stats np. Here is an image of these red and blue groups again to save you from having to scroll up : When we can see the colour of each point i. Divide by the total weight: essentially, we're finding where the weight is centred among our data points. Divide by the total weight: essentially, we're finding where the weight is centred among the values for the difference of each data point from the mean.
This is the estimate of the variance, take the positive square root to find the standard deviation. For our data, the first five iterations of this process look like this recent iterations have stronger appearance : We see that the means are already converging on some values, and the shapes of the curves governed by the standard deviation are also becoming more stable.
Improve this answer. Alex Riley Alex Riley k 42 42 gold badges silver badges bronze badges. It is possible, however, to estimate the proportion of points belonging to a group via EM here. Maybe if we overestimate k, the proportion of all but two of the groups would drop to near zero. I haven't experimented with this, so I don't know how well it would work in practice. AlexRiley Can you say a bit more about the formulas for computing the new mean and standard deviation estimates?
But is the EM algorithm needed instead of using some numerical technique to try to find a maximum of the likelihood with respect to the constraint of the set of equations mentioned. In general terms, the EM algorithm defines an iterative process that allows to maximize the likelihood function of a parametric model in the case in which some variables of the model are or are treated as "latent" or unknown.
In theory, for the same purpose, you can use a minimization algorithm to numerically find the maximum of the likelihood function for all parameters. However in real situation this minimization would be:. A very common application of the EM method is fitting a mixture model. In this case considering the variable that assign each sample to one of the component as "latent" variables the problem is greatly simplified.
Lets look at an example. To find the parameters without EM we should minimize:. On the contrary, using the EM algorithm, we first "assign" each sample to a component E step and then fit or maximize the likelihood of each component separately M step. EM is not needed instead of using some numerical technique because EM is a numerical method as well.
So it's not a substitute for Newton-Raphson. EM is for the specific case when you have missing values in your data matrix. For this purpose you use the EM method.
You then repeat these steps until the method converges to some value which will be your estimate. If you need more information on the method, its properties, proofs or applications just give a look at the corresponding Wiki article.
EM is used because it's often infeasible or impossible to directly calculate the parameters of a model that maximizes the probability of a dataset given that model. Sign up to join this community. And b why do we need to do this? Many thanks again. Because I implement it but found the result is A Is my understanding correct?
Show 5 more comments. Apply conditional probability definition. Is it probability of getting 9 heads and 1 tail in unbiased coin? Could you add to your answer how they generate the values in the red and the blue columns? And then how to generate theta A and B? I see how the numbers work, but I'm missing some of the formal notation you used. That's exactly what I'm thinking. Actually, "0. So bayesian is indeed the essential thing. Lerner Zhang Lerner Zhang 5 5 silver badges 22 22 bronze badges.
So each of these observation can either come from coin A with probability 0. Durgesh Kumar Durgesh Kumar 1 1 1 bronze badge. Sign up or log in Sign up using Google. Sign up using Facebook. Sign up using Email and Password. Post as a guest Name.
0コメント