At the very begining, no one knows which size it should be to accomedate the market. However, with some information about customers’ height, girth, hip, weight and so on, costumer can be grouped into some sub-groups. Based these sub-groups, shirts can be designed based on the average of each group and assign XS, S, L labels to those groups. In another words, this problem is equivalent to find some clusters based on provided information so that the variation within clusters is small, and between clusters variation is large.
example2
However, we don’t have these data, so we are going to start with the seeds data.
This tutorial illustrates the implentations of clustering and association rule mining in (R).
The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. A description of the dataset can be viewed at (https://archive.ics.uci.edu/ml/datasets/seeds)
Our goal is to cluster these variables into groups so we can take further study of them.
seed = read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt', header=F)
seed = seed[,1:7]
colnames(seed) = c("area", "perimeter","campactness", "length", "width", "asymmetry", "groovelength")
Scale data to have zero mean and unit variance for each column
seed <- scale(seed)
head(seed)
## area perimeter campactness length width asymmetry
## [1,] 0.14175904 0.214948819 6.045733e-05 0.30349301 0.1413640 -0.9838010
## [2,] 0.01116136 0.008204153 4.274938e-01 -0.16822270 0.1969616 -1.7839036
## [3,] -0.19160873 -0.359341919 1.438945e+00 -0.76181710 0.2075516 -0.6658882
## [4,] -0.34626388 -0.474200066 1.036904e+00 -0.68733567 0.3187467 -0.9585276
## [5,] 0.44419577 0.329806966 1.371233e+00 0.06650665 0.8032397 -1.5597684
## [6,] -0.16067770 -0.267455401 1.019976e+00 -0.54740087 0.1413640 -0.8235144
## groovelength
## [1,] -0.3826631
## [2,] -0.9198156
## [3,] -1.1863572
## [4,] -1.2270506
## [5,] -0.4742231
## [6,] -0.9198156
The basic idea of k-means clustering is to define clusters then minimize the total intra-cluster variation (known as total within-cluster variation). The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid: \[W(C_k) = \sum_{x_i \in C_k}(x_i - \mu_k)^2,\] where:
For clustering, one can rely on all kinds of distance measures and it is critical point. The distance measures will show how similar two elements (x, y) are and it will highly influence the results of the clustering analysis. The classical methods for distance measures are Euclidean and Manhattan distances, which are defined as follow:
Euclidean distance:
\(d_{euc}(x,y) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2} \tag{1}\)
Manhattan distance:
\(d_{man}(x,y) = \sum^n_{i=1}|(x_i - y_i)| \tag{2}\)
Pearson correlation distance:
\(d_{cor}(x, y) = 1 - \frac{\sum^n_{i=1}(x_i-\bar x)(y_i - \bar y)}{\sqrt{\sum^n_{i=1}(x_i-\bar x)^2\sum^n_{i=1}(y_i - \bar y)^2}} \tag{3}\)
In order to use k-means method for clustering and plot results, we can use kmeans
function in R. It will group the data into a specified number of clusters (centers = 5). As mentioned before, the algorithm randomly select k objects as the initial cluster centers to start the iteration, the final results may vary based on different initial centers. The nstart
option of this function can allow the algorithm to attempt multiple initial configurations and reports on the best one. I recommended to set nstart for this function, which could give stable result.
library(factoextra)
distance <- get_dist(seed)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
# K-Means Cluster Analysis
fit <- kmeans(seed, 2) #2 cluster solution
#Display number of clusters in each cluster
table(fit$cluster)
##
## 1 2
## 133 77
#Plot cluster in kmeans
fit
## K-means clustering with 2 clusters of sizes 133, 77
##
## Cluster means:
## area perimeter campactness length width asymmetry
## 1 -0.6588042 -0.662982 -0.3140631 -0.6519023 -0.6160407 0.08462825
## 2 1.1379346 1.145151 0.5424726 1.1260130 1.0640703 -0.14617607
## groovelength
## 1 -0.6639828
## 2 1.1468793
##
## Clustering vector:
## [1] 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2
## [38] 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 459.7972 196.2357
## (between_SS / total_SS = 55.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
cluster
: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.centers
: A matrix of cluster centres.totss
: The total sum of squares.withinss
: Vector of within-cluster sum of squares, one component per cluster.tot.withinss
: Total within-cluster sum of squares, i.e. sum(withinss).betweenss
: The between-cluster sum of squares, i.e. totss-tot.withinss.size
: The number of points in each cluster.iter
: The number of (outer) iterations.ifault
: integer: indicator of a possible algorithm problem – for experts.We want to view the result so We can use fviz_cluster
. It is function can provide a nice graph of the clusters. Usually, we have more than two dimensions (variables) fviz_cluster
will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
fviz_cluster(fit, data = seed)
k3 <- kmeans(seed, centers = 3, nstart = 25)
k4 <- kmeans(seed, centers = 4, nstart = 25)
k5 <- kmeans(seed, centers = 5, nstart = 25)
# plots to compare
p1 <- fviz_cluster(fit, geom = "point", data = seed) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = seed) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = seed) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = seed) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
#See exactly which item are in 1st group
seed[fit$cluster==1,]
#get cluster means for scaled data
aggregate(seed,by=list(fit$cluster),FUN=mean)
## Group.1 area perimeter campactness length width asymmetry
## 1 1 -0.6588042 -0.662982 -0.3140631 -0.6519023 -0.6160407 0.08462825
## 2 2 1.1379346 1.145151 0.5424726 1.1260130 1.0640703 -0.14617607
## groovelength
## 1 -0.6639828
## 2 1.1468793
#or alternatively, use the output of kmeans
fit$centers
## area perimeter campactness length width asymmetry
## 1 -0.6588042 -0.662982 -0.3140631 -0.6519023 -0.6160407 0.08462825
## 2 1.1379346 1.145151 0.5424726 1.1260130 1.0640703 -0.14617607
## groovelength
## 1 -0.6639828
## 2 1.1468793
We can determine the number of clusters by looking at the within group sum of squares as followed:
# Determine number of clusters
wss <- (nrow(seed)-1)*sum(apply(seed,2,var))
for (i in 2:12) wss[i] <- sum(kmeans(seed,
centers=i)$withinss)
plot(1:12, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
Three looks pretty well as increasing anymore does not decrese the within group sum of squares significantly.
We can also use Prediction strength for estimating number of clusters. The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9. We will use fpc packagehere.
This package also has cluster.stat() function that can calcuate other cluster validity measures such as Average Silhouette Coefficient (between -1 and 1, the higher the better), or Dunn index (betwen 0 and infinity, the higher the better):
install.packages("fpc")
library(fpc)
prediction.strength(seed, Gmin=2, Gmax=15, M=10,cutoff=0.8)
## Prediction strength
## Clustering method: kmeans
## Maximum number of clusters: 15
## Resampled data sets: 10
## Mean pred.str. for numbers of clusters: 1 0.8463348 0.7633325 0.5050631 0.4261617 0.3849686 0.3290805 0.3273578 0.248726 0.2319 0.190414 0.1412374 0.1282486 0.1226263 0.1192309
## Cutoff value: 0.8
## Largest number of clusters better than cutoff: 2
d = dist(seed, method = "euclidean")
result = matrix(nrow = 14, ncol = 3)
for (i in 2:15){
cluster_result = kmeans(seed, i)
clusterstat=cluster.stats(d, cluster_result$cluster)
result[i-1,1]=i
result[i-1,2]=clusterstat$avg.silwidth
result[i-1,3]=clusterstat$dunn
}
plot(result[,c(1,2)], type="l", ylab = 'silhouette width', xlab = 'number of clusters')
plot(result[,c(1,3)], type="l", ylab = 'dunn index', xlab = 'number of clusters')
Obviously, different index may suggest different cluster numbers.
If you wish more, the package NbClust
provides 30 indexes for determining the optimal number of clusters in a data set.
For more sophiscated methods, see for example blog, or course notes.
This article on Cross Validated provides a great illustration of the situations when k-means would fail.
You can also do Hierarchical clustering in R easily:
#Wards Method or Hierarchical clustering
#Calculate the distance matrix
seed.dist=dist(seed)
#Obtain clusters using the Wards method
seed.hclust=hclust(seed.dist, method="ward")
plot(seed.hclust)
#Cut dendrogram at the 3 clusters level and obtain cluster membership
seed.3clust = cutree(seed.hclust,k=3)
#See exactly which item are in third group
seed[seed.3clust==3,]
#get cluster means for raw data
#Centroid Plot against 1st 2 discriminant functions
#Load the fpc library needed for plotcluster function
library(fpc)
#plotcluster(ZooFood, fit$cluster)
plotcluster(seed, seed.3clust)
A newer clustering appraoch, model-based cluster, treats the clustering problem as maximizing a Normal mixture model. Generating an observation in this model consists of first picking a centroid (mean of a multivariate normal distribution) at random and then adding some noise (variances). If the noise is normally distributed, this procedure will result in clusters of spherical shape. Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. The model that we recover from the data then defines clusters and an assignment of documents to clusters. It can be thought as a generalization of \(K\)-means.
The model “recovering” process is done via Expectation-Maximization(EM) algorithm. It is an iterative approach to maximize the likelihood of a statistical model when the model contains unobserved variables.
One obvious advantage of the approach is that we can treat the question “How Many Clusters?” as a model selection problem.
For detailed description of the method and the package, see 1 and 2
install.packages('mclust')
library(mclust)
mclust_result = Mclust(seed)
summary(mclust_result)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEV (ellipsoidal, equal volume and shape) model with 4 components:
##
## log-likelihood n df BIC ICL
## 416.1655 210 122 179.9839 172.5306
##
## Clustering table:
## 1 2 3 4
## 50 49 45 66
The BIC used in the package is the negative of the ‘usual’ BIC when we discussed regression models. Therefore we are trying to maximize the BIC here.
plot(mclust_result)
Association Rules is a popular and well researched method for discovering interesting relations between variables in large databases. arules package in R provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules.
For an introduction to arules and additional case studies see Introduction to arules
For the reference manual for the package see arules package manual)
The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.
library(arules)
data("Groceries")
#run summary report
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
summary() displays the most frequent items in the data set, information about the transaction length distribution and that the data set contains some extended transaction information. We see that the data set contains transaction IDs. This additional information can be used for analyzing the data set.
To find the very long transactions we can use the size() and select very long transactions (containing more than 30 items).
#
x = Groceries[size(Groceries) > 30]
inspect(x)
## items
## [1] {frankfurter,
## sausage,
## liver loaf,
## ham,
## chicken,
## beef,
## citrus fruit,
## tropical fruit,
## root vegetables,
## other vegetables,
## whole milk,
## butter,
## curd,
## yogurt,
## whipped/sour cream,
## beverages,
## soft cheese,
## hard cheese,
## cream cheese ,
## mayonnaise,
## domestic eggs,
## rolls/buns,
## roll products ,
## flour,
## pasta,
## margarine,
## specialty fat,
## sugar,
## soups,
## skin care,
## hygiene articles,
## candles}
To see which items are important in the data set we can use the itemFrequencyPlot(). To reduce the number of items, we only plot the item frequency for items with a support greater than 10%. The label size is reduced with the parameter cex.names.
#
itemFrequencyPlot(Groceries, support = 0.1, cex.names=0.8)
Use apriori() algorithm to find all rules (the default association type for apriori()) with a minimum support of 0.3% and a confidence of 0.5.
# Run the apriori algorithm
basket_rules <- apriori(Groceries,parameter = list(sup = 0.003, conf = 0.5,target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 29
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [421 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(basket_rules)
## set of 421 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 5 281 128 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.325 4.000 5.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.003050 Min. :0.5000 Min. :1.957 Min. : 30.00
## 1st Qu.:0.003355 1st Qu.:0.5238 1st Qu.:2.135 1st Qu.: 33.00
## Median :0.003965 Median :0.5556 Median :2.426 Median : 39.00
## Mean :0.004754 Mean :0.5715 Mean :2.522 Mean : 46.75
## 3rd Qu.:0.005186 3rd Qu.:0.6094 3rd Qu.:2.766 3rd Qu.: 51.00
## Max. :0.022267 Max. :0.8857 Max. :5.804 Max. :219.00
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.003 0.5
Recall from class:
Support: The support of an itemset X or \(latex supp(X)\) is defined as the proportion of transactions in the data set which contain the itemset. In the zoo data, the support for the rules is relatively low, with a maximum support of no more than 3%.
Confidence: The confidence of a rule is defined as \(conf( X\Rightarrow Y) = supp( X \cup Y )/supp(X)\). For example, the rule {milk, bread} \(\Rightarrow\) {butter} has a confidence of 0.5, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the conditional probability P(Y |X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. Association rules are required to satisfy both a minimum support and a minimum confidence constraint at the same time.
Lift: Lift is a popular measure of to filter or rank found rules. The lift of a rule is defined as \(lift(X \Rightarrow Y ) = supp(X \cup Y )/(supp(X)supp(Y ))\). Lift can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Greater lift values indicate stronger associations.
# Check the generated rules using inspect
inspect(head(basket_rules))
## lhs rhs support confidence
## [1] {cereals} => {whole milk} 0.003660397 0.6428571
## [2] {specialty cheese} => {other vegetables} 0.004270463 0.5000000
## [3] {rice} => {other vegetables} 0.003965430 0.5200000
## [4] {rice} => {whole milk} 0.004677173 0.6133333
## [5] {baking powder} => {whole milk} 0.009252669 0.5229885
## [6] {root vegetables,herbs} => {other vegetables} 0.003863752 0.5507246
## lift count
## [1] 2.515917 36
## [2] 2.584078 42
## [3] 2.687441 39
## [4] 2.400371 46
## [5] 2.046793 91
## [6] 2.846231 38
As typical for association rule mining, the number of rules found is huge. To analyze these rules, for example, subset() can be used to produce separate subsets of rules. Now find the subset of rules that has 4 or more length (lhs+rhs).
#Basket rules of size greater than 4
inspect(subset(basket_rules, size(basket_rules)>4))
## lhs rhs support confidence lift count
## [1] {citrus fruit,
## tropical fruit,
## root vegetables,
## other vegetables} => {whole milk} 0.003152008 0.7045455 2.757344 31
## [2] {citrus fruit,
## tropical fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.003152008 0.8857143 4.577509 31
## [3] {citrus fruit,
## tropical fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.003152008 0.6326531 5.804238 31
## [4] {citrus fruit,
## root vegetables,
## other vegetables,
## whole milk} => {tropical fruit} 0.003152008 0.5438596 5.183004 31
## [5] {tropical fruit,
## root vegetables,
## other vegetables,
## yogurt} => {whole milk} 0.003558719 0.7142857 2.795464 35
## [6] {tropical fruit,
## root vegetables,
## whole milk,
## yogurt} => {other vegetables} 0.003558719 0.6250000 3.230097 35
## [7] {tropical fruit,
## root vegetables,
## other vegetables,
## whole milk} => {yogurt} 0.003558719 0.5072464 3.636128 35
Find the subset of rules with lift greater than 5:
inspect(subset(basket_rules, lift>5))
## lhs rhs support confidence lift count
## [1] {citrus fruit,
## tropical fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.003152008 0.6326531 5.804238 31
## [2] {citrus fruit,
## root vegetables,
## other vegetables,
## whole milk} => {tropical fruit} 0.003152008 0.5438596 5.183004 31
Now find the subset rules that has Yogurt in the right hand side. Here we require lift measure exceeds 1.2.
yogurt.rhs <- subset(basket_rules, subset = rhs %in% "yogurt" & lift>3.5)
Now inspect the subset rules
inspect(yogurt.rhs)
## lhs rhs support confidence lift count
## [1] {whipped/sour cream,
## cream cheese } => {yogurt} 0.003355363 0.5238095 3.754859 33
## [2] {root vegetables,
## cream cheese } => {yogurt} 0.003762074 0.5000000 3.584184 37
## [3] {tropical fruit,
## curd} => {yogurt} 0.005287239 0.5148515 3.690645 52
## [4] {other vegetables,
## whole milk,
## cream cheese } => {yogurt} 0.003457041 0.5151515 3.692795 34
## [5] {tropical fruit,
## whole milk,
## curd} => {yogurt} 0.003965430 0.6093750 4.368224 39
## [6] {tropical fruit,
## other vegetables,
## butter} => {yogurt} 0.003050330 0.5555556 3.982426 30
## [7] {tropical fruit,
## whole milk,
## butter} => {yogurt} 0.003355363 0.5409836 3.877969 33
## [8] {tropical fruit,
## whole milk,
## whipped/sour cream} => {yogurt} 0.004372140 0.5512821 3.951792 43
## [9] {tropical fruit,
## root vegetables,
## other vegetables,
## whole milk} => {yogurt} 0.003558719 0.5072464 3.636128 35
Now find the subset rules that has Meat in the left hand side.
meat.lhs <- subset(basket_rules, subset = lhs %in% "meat" & lift>1.5)
Now inspect the subset rules
inspect(meat.lhs)
## lhs rhs support confidence lift
## [1] {meat,root vegetables} => {whole milk} 0.003152008 0.62 2.426462
## count
## [1] 31
We can use the arulesViz package to visualize the rules, for a more complete introduction see (http://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf).
install.packages('arulesViz')
library('arulesViz')
plot(basket_rules)
The plot function has an interactive mode for you to inspect individual rules:
plot(basket_rules, interactive=TRUE)
Graph-based visualization can be used for very small sets of rules. The vertices are represented by items for the 10 rules with highest lift:
plot(head(sort(basket_rules, by="lift"), 10), method = "graph")
The package comes with an approach to cluster association rules and itemsets:
plot(basket_rules, method="grouped")
Iris data, simply use the Iris dataset in R. Please do a clustering analysis with this data and ignores the Species variable.
data(iris)
Cincinnati Zoo data, use the following code to load the transaction data for association rules mining. as() function coerce the dataset into transaction data type for association rules mining.
TransFood <- read.csv('https://xiaoruizhu.github.io/Data-Mining-R/data/food_4_association.csv')
TransFood <- TransFood[, -1]
# Find out elements that are not equal to 0 or 1 and change them to 1.
Others <- which(!(as.matrix(TransFood) ==1 | as.matrix(TransFood) ==0), arr.ind=T )
TransFood[Others] <- 1
TransFood <- as(as.matrix(TransFood), "transactions")
Load the data for clustering:
Food_by_month <- read.csv('https://xiaoruizhu.github.io/Data-Mining-R/data/qry_Food_by_Month.csv')