1 Simple Example beforehand:

How do the clothing companies decide the size of their T-shirt?

At the very begining, no one knows which size it should be to accomedate the market. However, with some information about customers’ height, girth, hip, weight and so on, costumer can be grouped into some sub-groups. Based these sub-groups, shirts can be designed based on the average of each group and assign XS, S, L labels to those groups. In another words, this problem is equivalent to find some clusters based on provided information so that the variation within clusters is small, and between clusters variation is large.

Or, clustering is a great way to group and target your customer from the general public:

example2

go to top

However, we don’t have these data, so we are going to start with the seeds data.

2 Summary of Seeds data

This tutorial illustrates the implentations of clustering and association rule mining in (R).

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. A description of the dataset can be viewed at (https://archive.ics.uci.edu/ml/datasets/seeds)

Our goal is to cluster these variables into groups so we can take further study of them.

seed = read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt', header=F)
seed = seed[,1:7]
colnames(seed) = c("area", "perimeter","campactness", "length", "width", "asymmetry", "groovelength")

Scale data to have zero mean and unit variance for each column

seed <- scale(seed) 
head(seed)

##             area    perimeter  campactness      length     width  asymmetry
## [1,]  0.14175904  0.214948819 6.045733e-05  0.30349301 0.1413640 -0.9838010
## [2,]  0.01116136  0.008204153 4.274938e-01 -0.16822270 0.1969616 -1.7839036
## [3,] -0.19160873 -0.359341919 1.438945e+00 -0.76181710 0.2075516 -0.6658882
## [4,] -0.34626388 -0.474200066 1.036904e+00 -0.68733567 0.3187467 -0.9585276
## [5,]  0.44419577  0.329806966 1.371233e+00  0.06650665 0.8032397 -1.5597684
## [6,] -0.16067770 -0.267455401 1.019976e+00 -0.54740087 0.1413640 -0.8235144
##      groovelength
## [1,]   -0.3826631
## [2,]   -0.9198156
## [3,]   -1.1863572
## [4,]   -1.2270506
## [5,]   -0.4742231
## [6,]   -0.9198156

go to top

3 K-means

The basic idea of k-means clustering is to define clusters then minimize the total intra-cluster variation (known as total within-cluster variation). The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid: \[W(C_k) = \sum_{x_i \in C_k}(x_i - \mu_k)^2,\] where:

\(x_i\) is a data point belonging to the cluster \(C_k\)
\(\mu_i\) is the mean value of the points assigned to the cluster \(C_k\)

For clustering, one can rely on all kinds of distance measures and it is critical point. The distance measures will show how similar two elements (x, y) are and it will highly influence the results of the clustering analysis. The classical methods for distance measures are Euclidean and Manhattan distances, which are defined as follow:

Euclidean distance:

\(d_{euc}(x,y) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2} \tag{1}\)

Manhattan distance:

\(d_{man}(x,y) = \sum^n_{i=1}|(x_i - y_i)| \tag{2}\)

Pearson correlation distance:

\(d_{cor}(x, y) = 1 - \frac{\sum^n_{i=1}(x_i-\bar x)(y_i - \bar y)}{\sqrt{\sum^n_{i=1}(x_i-\bar x)^2\sum^n_{i=1}(y_i - \bar y)^2}} \tag{3}\)

In order to use k-means method for clustering and plot results, we can use kmeans function in R. It will group the data into a specified number of clusters (centers = 5). As mentioned before, the algorithm randomly select k objects as the initial cluster centers to start the iteration, the final results may vary based on different initial centers. The nstart option of this function can allow the algorithm to attempt multiple initial configurations and reports on the best one. I recommended to set nstart for this function, which could give stable result.

library(factoextra)
distance <- get_dist(seed)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# K-Means Cluster Analysis
fit <- kmeans(seed, 2) #2 cluster solution
#Display number of clusters in each cluster
table(fit$cluster)

## 
##   1   2 
## 133  77

#Plot cluster in kmeans

fit

## K-means clustering with 2 clusters of sizes 133, 77
## 
## Cluster means:
##         area perimeter campactness     length      width   asymmetry
## 1 -0.6588042 -0.662982  -0.3140631 -0.6519023 -0.6160407  0.08462825
## 2  1.1379346  1.145151   0.5424726  1.1260130  1.0640703 -0.14617607
##   groovelength
## 1   -0.6639828
## 2    1.1468793
## 
## Clustering vector:
##   [1] 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2
##  [38] 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 459.7972 196.2357
##  (between_SS / total_SS =  55.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

cluster: A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centers: A matrix of cluster centres.
totss: The total sum of squares.
withinss: Vector of within-cluster sum of squares, one component per cluster.
tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
betweenss: The between-cluster sum of squares, i.e. totss-tot.withinss.
size: The number of points in each cluster.
iter: The number of (outer) iterations.
ifault: integer: indicator of a possible algorithm problem – for experts.

We want to view the result so We can use fviz_cluster. It is function can provide a nice graph of the clusters. Usually, we have more than two dimensions (variables) fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

fviz_cluster(fit, data = seed)

k3 <- kmeans(seed, centers = 3, nstart = 25)
k4 <- kmeans(seed, centers = 4, nstart = 25)
k5 <- kmeans(seed, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(fit, geom = "point", data = seed) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = seed) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = seed) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = seed) + ggtitle("k = 5")

library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)

#See exactly which item are in 1st group
seed[fit$cluster==1,]

#get cluster means for scaled data
aggregate(seed,by=list(fit$cluster),FUN=mean)

##   Group.1       area perimeter campactness     length      width   asymmetry
## 1       1 -0.6588042 -0.662982  -0.3140631 -0.6519023 -0.6160407  0.08462825
## 2       2  1.1379346  1.145151   0.5424726  1.1260130  1.0640703 -0.14617607
##   groovelength
## 1   -0.6639828
## 2    1.1468793

#or alternatively, use the output of kmeans
fit$centers

##         area perimeter campactness     length      width   asymmetry
## 1 -0.6588042 -0.662982  -0.3140631 -0.6519023 -0.6160407  0.08462825
## 2  1.1379346  1.145151   0.5424726  1.1260130  1.0640703 -0.14617607
##   groovelength
## 1   -0.6639828
## 2    1.1468793

3.1 Determine number of clusters

We can determine the number of clusters by looking at the within group sum of squares as followed:

# Determine number of clusters
wss <- (nrow(seed)-1)*sum(apply(seed,2,var))
for (i in 2:12) wss[i] <- sum(kmeans(seed,
                                     centers=i)$withinss)
plot(1:12, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

Three looks pretty well as increasing anymore does not decrese the within group sum of squares significantly.

We can also use Prediction strength for estimating number of clusters. The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9. We will use fpc packagehere.

This package also has cluster.stat() function that can calcuate other cluster validity measures such as Average Silhouette Coefficient (between -1 and 1, the higher the better), or Dunn index (betwen 0 and infinity, the higher the better):

install.packages("fpc")

library(fpc)
prediction.strength(seed, Gmin=2, Gmax=15, M=10,cutoff=0.8)

## Prediction strength 
## Clustering method:  kmeans 
## Maximum number of clusters:  15 
## Resampled data sets:  10 
## Mean pred.str. for numbers of clusters:  1 0.8463348 0.7633325 0.5050631 0.4261617 0.3849686 0.3290805 0.3273578 0.248726 0.2319 0.190414 0.1412374 0.1282486 0.1226263 0.1192309 
## Cutoff value:  0.8 
## Largest number of clusters better than cutoff:  2

d = dist(seed, method = "euclidean")
result = matrix(nrow = 14, ncol = 3)
for (i in 2:15){
  cluster_result = kmeans(seed, i)
  clusterstat=cluster.stats(d, cluster_result$cluster)
  result[i-1,1]=i
  result[i-1,2]=clusterstat$avg.silwidth
  result[i-1,3]=clusterstat$dunn   
}
plot(result[,c(1,2)], type="l", ylab = 'silhouette width', xlab = 'number of clusters')

plot(result[,c(1,3)], type="l", ylab = 'dunn index', xlab = 'number of clusters')

Obviously, different index may suggest different cluster numbers.

If you wish more, the package NbClust provides 30 indexes for determining the optimal number of clusters in a data set.

For more sophiscated methods, see for example blog, or course notes.

This article on Cross Validated provides a great illustration of the situations when k-means would fail.

go to top

4 Hierarchical clustering

You can also do Hierarchical clustering in R easily:

#Wards Method or Hierarchical clustering
#Calculate the distance matrix
seed.dist=dist(seed)
#Obtain clusters using the Wards method
seed.hclust=hclust(seed.dist, method="ward")
plot(seed.hclust)

#Cut dendrogram at the 3 clusters level and obtain cluster membership
seed.3clust = cutree(seed.hclust,k=3)

#See exactly which item are in third group
seed[seed.3clust==3,]

#get cluster means for raw data
#Centroid Plot against 1st 2 discriminant functions
#Load the fpc library needed for plotcluster function
library(fpc)
#plotcluster(ZooFood, fit$cluster)
plotcluster(seed, seed.3clust)

go to top

5 (Optional) Model-Based Cluster Analysis

A newer clustering appraoch, model-based cluster, treats the clustering problem as maximizing a Normal mixture model. Generating an observation in this model consists of first picking a centroid (mean of a multivariate normal distribution) at random and then adding some noise (variances). If the noise is normally distributed, this procedure will result in clusters of spherical shape. Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. The model that we recover from the data then defines clusters and an assignment of documents to clusters. It can be thought as a generalization of \(K\)-means.

The model “recovering” process is done via Expectation-Maximization(EM) algorithm. It is an iterative approach to maximize the likelihood of a statistical model when the model contains unobserved variables.

One obvious advantage of the approach is that we can treat the question “How Many Clusters?” as a model selection problem.

For detailed description of the method and the package, see 1 and 2

install.packages('mclust')

library(mclust)

mclust_result = Mclust(seed)

summary(mclust_result)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEV (ellipsoidal, equal volume and shape) model with 4 components: 
## 
##  log-likelihood   n  df      BIC      ICL
##        416.1655 210 122 179.9839 172.5306
## 
## Clustering table:
##  1  2  3  4 
## 50 49 45 66

The BIC used in the package is the negative of the ‘usual’ BIC when we discussed regression models. Therefore we are trying to maximize the BIC here.

plot(mclust_result)

go to top

6 Association Rules

Association Rules is a popular and well researched method for discovering interesting relations between variables in large databases. arules package in R provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules.

For an introduction to arules and additional case studies see Introduction to arules
For the reference manual for the package see arules package manual)

Example:

The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.

library(arules)
data("Groceries")

#run summary report
summary(Groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

summary() displays the most frequent items in the data set, information about the transaction length distribution and that the data set contains some extended transaction information. We see that the data set contains transaction IDs. This additional information can be used for analyzing the data set.

To find the very long transactions we can use the size() and select very long transactions (containing more than 30 items).

#
x = Groceries[size(Groceries) > 30]
inspect(x)

##     items               
## [1] {frankfurter,       
##      sausage,           
##      liver loaf,        
##      ham,               
##      chicken,           
##      beef,              
##      citrus fruit,      
##      tropical fruit,    
##      root vegetables,   
##      other vegetables,  
##      whole milk,        
##      butter,            
##      curd,              
##      yogurt,            
##      whipped/sour cream,
##      beverages,         
##      soft cheese,       
##      hard cheese,       
##      cream cheese ,     
##      mayonnaise,        
##      domestic eggs,     
##      rolls/buns,        
##      roll products ,    
##      flour,             
##      pasta,             
##      margarine,         
##      specialty fat,     
##      sugar,             
##      soups,             
##      skin care,         
##      hygiene articles,  
##      candles}

To see which items are important in the data set we can use the itemFrequencyPlot(). To reduce the number of items, we only plot the item frequency for items with a support greater than 10%. The label size is reduced with the parameter cex.names.

#
itemFrequencyPlot(Groceries, support = 0.1, cex.names=0.8)

Use apriori() algorithm to find all rules (the default association type for apriori()) with a minimum support of 0.3% and a confidence of 0.5.

# Run the apriori algorithm
basket_rules <- apriori(Groceries,parameter = list(sup = 0.003, conf = 0.5,target="rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 29 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [421 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(basket_rules)

## set of 421 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##   5 281 128   7 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.325   4.000   5.000 
## 
## summary of quality measures:
##     support           confidence          lift           count       
##  Min.   :0.003050   Min.   :0.5000   Min.   :1.957   Min.   : 30.00  
##  1st Qu.:0.003355   1st Qu.:0.5238   1st Qu.:2.135   1st Qu.: 33.00  
##  Median :0.003965   Median :0.5556   Median :2.426   Median : 39.00  
##  Mean   :0.004754   Mean   :0.5715   Mean   :2.522   Mean   : 46.75  
##  3rd Qu.:0.005186   3rd Qu.:0.6094   3rd Qu.:2.766   3rd Qu.: 51.00  
##  Max.   :0.022267   Max.   :0.8857   Max.   :5.804   Max.   :219.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.003        0.5

Recall from class:

Support: The support of an itemset X or \(latex supp(X)\) is defined as the proportion of transactions in the data set which contain the itemset. In the zoo data, the support for the rules is relatively low, with a maximum support of no more than 3%.
Confidence: The confidence of a rule is defined as \(conf( X\Rightarrow Y) = supp( X \cup Y )/supp(X)\). For example, the rule {milk, bread} \(\Rightarrow\) {butter} has a confidence of 0.5, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the conditional probability P(Y |X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. Association rules are required to satisfy both a minimum support and a minimum confidence constraint at the same time.
Lift: Lift is a popular measure of to filter or rank found rules. The lift of a rule is defined as \(lift(X \Rightarrow Y ) = supp(X \cup Y )/(supp(X)supp(Y ))\). Lift can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Greater lift values indicate stronger associations.

# Check the generated rules using inspect
inspect(head(basket_rules))

##     lhs                        rhs                support     confidence
## [1] {cereals}               => {whole milk}       0.003660397 0.6428571 
## [2] {specialty cheese}      => {other vegetables} 0.004270463 0.5000000 
## [3] {rice}                  => {other vegetables} 0.003965430 0.5200000 
## [4] {rice}                  => {whole milk}       0.004677173 0.6133333 
## [5] {baking powder}         => {whole milk}       0.009252669 0.5229885 
## [6] {root vegetables,herbs} => {other vegetables} 0.003863752 0.5507246 
##     lift     count
## [1] 2.515917 36   
## [2] 2.584078 42   
## [3] 2.687441 39   
## [4] 2.400371 46   
## [5] 2.046793 91   
## [6] 2.846231 38

As typical for association rule mining, the number of rules found is huge. To analyze these rules, for example, subset() can be used to produce separate subsets of rules. Now find the subset of rules that has 4 or more length (lhs+rhs).

#Basket rules of size greater than 4
inspect(subset(basket_rules, size(basket_rules)>4))

##     lhs                   rhs                    support confidence     lift count
## [1] {citrus fruit,                                                                
##      tropical fruit,                                                              
##      root vegetables,                                                             
##      other vegetables} => {whole milk}       0.003152008  0.7045455 2.757344    31
## [2] {citrus fruit,                                                                
##      tropical fruit,                                                              
##      root vegetables,                                                             
##      whole milk}       => {other vegetables} 0.003152008  0.8857143 4.577509    31
## [3] {citrus fruit,                                                                
##      tropical fruit,                                                              
##      other vegetables,                                                            
##      whole milk}       => {root vegetables}  0.003152008  0.6326531 5.804238    31
## [4] {citrus fruit,                                                                
##      root vegetables,                                                             
##      other vegetables,                                                            
##      whole milk}       => {tropical fruit}   0.003152008  0.5438596 5.183004    31
## [5] {tropical fruit,                                                              
##      root vegetables,                                                             
##      other vegetables,                                                            
##      yogurt}           => {whole milk}       0.003558719  0.7142857 2.795464    35
## [6] {tropical fruit,                                                              
##      root vegetables,                                                             
##      whole milk,                                                                  
##      yogurt}           => {other vegetables} 0.003558719  0.6250000 3.230097    35
## [7] {tropical fruit,                                                              
##      root vegetables,                                                             
##      other vegetables,                                                            
##      whole milk}       => {yogurt}           0.003558719  0.5072464 3.636128    35

Find the subset of rules with lift greater than 5:

inspect(subset(basket_rules, lift>5))

##     lhs                   rhs                   support confidence     lift count
## [1] {citrus fruit,                                                               
##      tropical fruit,                                                             
##      other vegetables,                                                           
##      whole milk}       => {root vegetables} 0.003152008  0.6326531 5.804238    31
## [2] {citrus fruit,                                                               
##      root vegetables,                                                            
##      other vegetables,                                                           
##      whole milk}       => {tropical fruit}  0.003152008  0.5438596 5.183004    31

Now find the subset rules that has Yogurt in the right hand side. Here we require lift measure exceeds 1.2.

yogurt.rhs <- subset(basket_rules, subset = rhs %in% "yogurt" & lift>3.5)

Now inspect the subset rules

inspect(yogurt.rhs)

##     lhs                     rhs          support confidence     lift count
## [1] {whipped/sour cream,                                                  
##      cream cheese }      => {yogurt} 0.003355363  0.5238095 3.754859    33
## [2] {root vegetables,                                                     
##      cream cheese }      => {yogurt} 0.003762074  0.5000000 3.584184    37
## [3] {tropical fruit,                                                      
##      curd}               => {yogurt} 0.005287239  0.5148515 3.690645    52
## [4] {other vegetables,                                                    
##      whole milk,                                                          
##      cream cheese }      => {yogurt} 0.003457041  0.5151515 3.692795    34
## [5] {tropical fruit,                                                      
##      whole milk,                                                          
##      curd}               => {yogurt} 0.003965430  0.6093750 4.368224    39
## [6] {tropical fruit,                                                      
##      other vegetables,                                                    
##      butter}             => {yogurt} 0.003050330  0.5555556 3.982426    30
## [7] {tropical fruit,                                                      
##      whole milk,                                                          
##      butter}             => {yogurt} 0.003355363  0.5409836 3.877969    33
## [8] {tropical fruit,                                                      
##      whole milk,                                                          
##      whipped/sour cream} => {yogurt} 0.004372140  0.5512821 3.951792    43
## [9] {tropical fruit,                                                      
##      root vegetables,                                                     
##      other vegetables,                                                    
##      whole milk}         => {yogurt} 0.003558719  0.5072464 3.636128    35

Now find the subset rules that has Meat in the left hand side.

meat.lhs <- subset(basket_rules, subset = lhs %in% "meat" & lift>1.5)

Now inspect the subset rules

inspect(meat.lhs)

##     lhs                       rhs          support     confidence lift    
## [1] {meat,root vegetables} => {whole milk} 0.003152008 0.62       2.426462
##     count
## [1] 31

We can use the arulesViz package to visualize the rules, for a more complete introduction see (http://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf).

install.packages('arulesViz')

library('arulesViz')

plot(basket_rules)

The plot function has an interactive mode for you to inspect individual rules:

plot(basket_rules, interactive=TRUE)

Graph-based visualization can be used for very small sets of rules. The vertices are represented by items for the 10 rules with highest lift:

plot(head(sort(basket_rules, by="lift"), 10), method = "graph")

The package comes with an approach to cluster association rules and itemsets:

plot(basket_rules, method="grouped")

go to top

7 Exercise:

7.0.1 exmaple 1:

Iris data, simply use the Iris dataset in R. Please do a clustering analysis with this data and ignores the Species variable.

data(iris)

7.0.2 exmaple 2:

Cincinnati Zoo data, use the following code to load the transaction data for association rules mining. as() function coerce the dataset into transaction data type for association rules mining.

TransFood <- read.csv('https://xiaoruizhu.github.io/Data-Mining-R/data/food_4_association.csv')
TransFood <- TransFood[, -1]
# Find out elements that are not equal to 0 or 1 and change them to 1.
Others <- which(!(as.matrix(TransFood) ==1 | as.matrix(TransFood) ==0), arr.ind=T )
TransFood[Others] <- 1
TransFood <- as(as.matrix(TransFood), "transactions")

7.0.3 exmaple 3:

Load the data for clustering:

Food_by_month <- read.csv('https://xiaoruizhu.github.io/Data-Mining-R/data/qry_Food_by_Month.csv')

go to top

Unsupervised Learning

1 Simple Example beforehand:

2 Summary of Seeds data

3 K-means

3.1 Determine number of clusters

4 Hierarchical clustering

5 (Optional) Model-Based Cluster Analysis

6 Association Rules

7 Exercise:

7.0.1 exmaple 1:

7.0.2 exmaple 2:

7.0.3 exmaple 3: