R/stats : Outlier Detection Using Hierarchical Clustering
Monday, February 29, 2016
Kmeans is the most popular among the clustering techniques but comes with an overhead of selecting an optimum cluster size for more effective output.Hierarchical Clustering option is a good option in such scenarios since there is no need to provide the cluster information to the algorithmn.
Moreover the Single linkage form of Hierarchical clustering which calculates minimal distance between clusters displays an effect of chaining which can be used to detect outliers since in case of such effect,the outlier point would be the last one to coverge.Also Dendrogram is a good visualization tool available with this custering technique which dispalys at what point the data points converge into a single cluster and an outlier can be clearly identified based on the height at which it is converging into the cluster.
As an example,let us take a dataset dc1 containing attack events with the attributes of volume,distribution,weightage of the the events and execute Hierarchical clustering on the dataset as below:
attack <- dc1[,-1] rownames(attack) <- dc1[,1] d <- dist(attack, method="euclidean") fit <- hclust(d, method="single") plot(fit) rect.hclust(run_single, k = 5, border = 2:6)
The output of the algorithm can be represented as Dendrogram as below:
In the above case,clearly the IP 184.108.40.206 seems to be an oulier as it is converging at a much higher level than rest.