Attend FREE Webinar on Data Science for Career Growth Register Now

Data Analytics Blog

Data Analytics Case Studies, WhyTos, HowTos, Interviews, News, Events, Jobs and more...

The Top 5 Clustering Algorithms Data Scientists Should Know

Rate this post

Clustering Algorithms are essential aspects of Data Science. Evert data scientist should be aware of the concept of clustering algorithms and its significance. Before we go on to discuss the top 5 clustering algorithms that data scientists should know, we shall briefly see what clustering is and how it can help in Data Science.

What is Clustering?

In simple words, Clustering is a Machine Learning technique involving the grouping of data points. It is an unsupervised learning method and a popular technique for statistical data analysis. For a given set of data points, you can use clustering algorithms to classify these individual data points into specific groups. As a result, the data points in a particular group exhibit similar properties. At the same time, data points in different groups have dissimilar features.

The Significance of Clustering in Data Science

The significance of clustering algorithms is to make sense and extract value from large quantities of structured as well as unstructured data. It allows you to segregate data based on their properties or features and group them into different clusters depending on their similarities.

Clustering algorithms have a variety of uses in different sectors. For example, you need clustering algorithms for classifying diseases in medical science. At the same time, clustering helps you to classify your customers in the field of market research.

We shall look at 5 popular clustering algorithms that every data scientist should be aware of.

1.    K-means Clustering

K-means Clustering is the most common clustering algorithm in Data Science. As it is easy to understand and implement, this algorithm forms a critical aspect of introductory data science and machine learning. The following graphic will help us understand the concept better.

 

What do we infer from the graphic?

  1. Select some classes or groups and randomly initialize the center points. It is crucial to determine the number of classes you use. Therefore, take a good look at the available data and identify distinct characteristics. The center points, denoted as X in the graphic are vectors having the same length as each data point vector.
  2. You classify each data point by calculating the distance between the particular points and each group centre. The next step is to classify the point to belong to the group whose centre is the nearest to it.
  3. Based on this information, take out the mean of all the vectors in the particular group and recalculate the group centre.
  4. Repeat the procedure for a number of Ensure that the group centres do not vary much between iterations.

Pros

  • K-means is a fast method because it does not have many computations.

Cons

  • Identifying and classifying the groups can be a challenging aspect.
  • As it starts with a random choice of cluster centres, the results can lack consistency.

2.    Mean-Shift Clustering

Mean-Shift Clustering algorithm, a sliding window type algorithm helps you find the dense areas of the data points. It is a centroid-based algorithm with an objective of locating the centre points of each group and works by updating the candidates for the centre points as the mean of the points within the sliding window. The post-processing stage is the filtration of the candidate windows to eliminate the near-duplicates thereby resulting in the formation of a final set of centre points along with their corresponding groups.

 

Explanation

  1. Assume that you have a set of points in two-dimensional space. Begin with a circular sliding window having its center at a randomly selected point, C with radius r as the kernel. This hill-climbing algorithm involves shifting the kernel to an area of higher density on each step until convergence.
  2. At every iteration, the window shifts towards the denser regions by changing the centre point to the mean of the points within the window. Higher the number of points inside the window, higher is the density within the sliding window. As a result, shifting the mean of the points inside the window entails that the window gradually moves towards the denser regions.
  3. Continue shifting the window according to the mean until you reach the point where you accommodate the maximum number of points within it.
  4. Repeat this process with multiple sliding windows until you come to a situation wherein all the points will lie within a window. In the case of overlapping of windows, the window having the higher number of points will prevail. Now, you cluster the data points according to the sliding window in which they are present.

The final result will look like this graphic.

 

Pros

  • Unlike the K-means clustering algorithm, you need not select the number of clusters.
  • The cluster centers converging towards the point of maximum density is a desirable aspect as it fits well in the data-driven sense.

Cons

  • The selection of the window size or the radius t is a non-trivial issue.

3.    DBSCAN – Density-Based Spatial Clustering of Applications with Noise

The DBSCAN, a density-based clustering algorithm is an improvement over the Mean-Shift clustering as it has specific advantages. The following graphic can clear out the matter for you.

 

  1. It starts with a random unvisited starting data point. All points within a distance ‘Epsilon – Ɛ classify as neighbourhood points.
  2. You need a minimum number of points within the neighbourhood to start the clustering process. Under such circumstances, the current data point becomes the first point in the cluster. Otherwise, the point gets labelled as ‘Noise.’ In either case, the current point becomes a visited point.
  3. All points within the distance Ɛ become part of the same cluster. Repeat the procedure for all the new points added to the cluster group.
  4. Continue with the process until you visit and label each point within the Ɛ neighbourhood of the cluster.
  5. On completion of the process, start again with a new unvisited point thereby leading to the discovery of more clusters or noise. At the end of the process, you ensure that you mark each point as either cluster or noise.

Pros

  • The DBSCAN is better than other cluster algorithms because it does not require a pre-set number of clusters.
  • It identifies outliers as noise, unlike the Mean-Shift method that forces such points into the cluster in spite of having different characteristics.
  • It finds arbitrarily shaped and sized clusters quite well.

Cons

  • It is not very effective when you have clusters of varying densities. There is a variance in the setting of the distance threshold Ɛ and the minimum points for identifying the neighbourhood when there is a change in the density levels.
  • If you have high dimensional data, the determining of the distance threshold Ɛ becomes a challenging task.

4.    EM using GMM – Expectation – Maximization (EM) Clustering using Gaussian Mixture Models (GMM)

The GMMs are more flexible than the K-means clustering. We begin with the assumption that the data points are Gaussian distributed. There are two parameters to describe the shape of each cluster, the mean and the standard deviation. As we have a standard deviation in both X and Y directions, a cluster can take any elliptical shape. It is not mandatory for them to have a circular shape. Thus, every single cluster has a Gaussian distribution.

We use an optimization algorithm known as Expectation – Maximization (EM) to find out the parameters of the Gaussian for each cluster. The following graphic will explain things better.

 

  1. Similar to the K-means cluster, we select the number of clusters and randomly initialize the Gaussian distribution parameters for each one of them.
  2. With this background, calculate the probability of each data point belonging to a particular cluster. The closer the point is to the Gaussian’s center, the better are the chances of it belonging to the cluster.
  3. Based on these calculations, we determine a new set of parameters for the Gaussian distributions to maximize the probabilities of data points within the clusters. We use a weighted sum of data point positions to compute these probabilities. The likelihood of the data point belonging to the particular cluster is the weight factor.
  4. Repeat the steps 2 and 3 until convergence where there is not much variation.

Pros

  • There is a higher level of flexibility regarding cluster covariance in the GMMs as compared to the K-means clustering because of the concept of standard deviation.
  • As this concept uses probability, you have multiple clusters per data point. Therefore, if a particular data point belongs to two overlapping clusters, we can further define it by saying it belongs A% to Class 1 and B% to Class 2.

5.    Agglomerative Hierarchical Clustering

You have two categories of hierarchical clustering algorithms, the top-down and the bottom-up. The Bottom-up concept treats each data point as an individual cluster at the initial stage. It merges pairs of clusters until you have a single group containing all data points. Hence, it is also known as Hierarchical Agglomerative Clustering (HAC). Compare it to a tree where the root is the unique cluster that gathers all samples with the leaves as the clusters with a single sample. The following graphic will explain this concept better.

  1. Consider each data point as an individual cluster. The second step is to select a distance metric to measure the distance between the two groups. Use the average linkage method where the distance between two clusters is the average distance between the data points in one cluster and the data points in the other.
  2. At each iteration, we merge two clusters with the smallest average linkage into one.
  3. Repeat the above step until we have one large cluster containing all the data points.

Pros

  • There is no need to specify the number of clusters. You have the option of choosing the best-looking clusters.
  • This algorithm is not sensitive to the choice of distance metric.

Best Clustering Algorithms in Data Mining

There are various types of clustering algorithms in data mining. Here are some clustering algorithms in data mining with examples.

  1. Data Mining Connectivity Models – Hierarchical Clustering
  2. Data Mining Centroid Models – K-means Clustering algorithm
  3. Data Mining Distribution Models – EM algorithm
  4. Data Mining Density Models – DBSCAN

Inference

Many people confuse data mining with data science. You can refer to data mining as a close relative of data science because it works on similar principles. Clustering algorithms are a critical part of data science and hence has significance in data mining as well. Any data scientist should be aware of the clustering algorithms discussed above. These are the five most commonly discussed algorithms in data science. Visit the Digital Vidya site for more information regarding clustering algorithms and their significance to data mining.

Srinivasan, more popularly known as Srini, is the person to turn to for writing blogs and informative articles on various subjects like banking, insurance, social media marketing, education, and product review descriptions. Writing articles on digital marketing and social media marketing comes naturally to him. Similarly, he has the capacity and more importantly, the patience to do in-depth research before committing anything on paper.

  • Data-Analytics

  • Your Comment

    Your email address will not be published.