Cluster Validation

- After generating random data, we scale both the original dataset and the random dataset.
- Then, we remove the last column, Species, from the iris dataset.
- We can visualize both datasets with the `pairs` function or in the Principal Component space.
- Correlation = linear dependence.
- We are generating data in such a way that they are independent from all points of view (linearly, quadratically, etc.) (stochastic independence)
- Apart from visual inspection, we want to define a statistic measure for clustering tendency.
	- Hopkins statistic
		- It's a measure of cluster tendency which takes values from $[0,1]$
		- The smallest the measure, the more clustered our data is.
			- 1. X is our observed data matrix of dimension $n\times d$ .
				2. We select randomly (without replacement) $m<n$ rows from our dataset X
				3. We generate a new dataset $Y$ of dimension $m \times d$
				4. We generate a data set $Z$ , of dimension $m\times d$ , by a d-variate uniform distribution on the hyper-rectangle with $d$ sides of length equal to the $d$ ranges of the original variables.
				5. For each unit $i = 1, \ldots, m$ define two distances:
				* $o_i$ the distance of row $i$ in $Y$ from its nearest neighbor in $X$
				* $r_i$ the distance of row $i$ in $Z$ from its nearest neighbor in $X$ .
				6. Finally, we calculate $H$ as:
$$
H = \frac{\sum_{i=1}^m o_i}{\sum_{i=1}^m r_i + \sum_{i=1}^m o_i }
$$

			- The $H$ index is not stable, as it relies on uniformely distributed data.
			- We can make it stable by computing it like 100 times and computing their average.
			- In R, the Hopkins statistic is included in the `clustertend` package (`hopkins()`)
			-
	- *For discrete data, a bubble plot is more accurate than a normal plot. The size of the points is related to the number of times each pair is repeated.*
	- There is another method which is still visual, but more specific:
	- #TODO: Generalize the idea of the hopkins statistic for discrete data
-
	- Visual Assessment of cluster Tendency (VAT)
		- Three steps:
			1. Compute the dissimilarity matrix (DM) between the units in the dataset.
			2. Reorder the DM so that similar units are close to one another. This creates an Ordered Dissimilarity Matrix (ODM)
			3. We visualize the dissimilarity matrix with different colors.

- If after assessing you realize that there is cluster tendency, the next step is determining the optimal number of clusters.
	- Direct methods:
	  optimize a criterion, such as the WSS or the average silhouette.
	- Statistical testing methods:
	  compare evidence against a null hypothesis of no clusters in our data (gap statistic).
	- Elbow method
		- Repeating the same k-means algorithm $n$ times with $k=1, 2, ..., n$ and compute the WSS for each iteration.
		- Typically, the function you get is decreasing vs the number of clusters.
		- You can decide the number of clusters by looking at the elbow in the plot. The elbow is subjective, it's not determined. Also, the elbow is not always so clear in the graphical representation.
		- Another algorithm is based on silhouette width. Same approach as before: compute $n$ interations and average the silhouette widths for each unit. You plot the value of the average silhouette width vs the K value and choose the number of clusters with the maximum average silhouette width.
	- The gap statistic
		- Compare each WSS with the case of no clusters (tipically a uniform distribution).
		- 1. Cluster the data for $K = 1, ... ,K_{max}$ and compute the corresponding $WSS_K$
			2. For each $b=1, ..., B$ (500 times), generate a reference data set from a d-variate uniform distribution. Cluster the data for $K = 1, ... ,K_{max}$ and computing the corresponding WSSs: WSS_{ Kb}
			  (if b=500 and K=10, we have 5000 WSS to be computed.)

		- 3. For each value of $K = 1, ... ,K_{max}$ , compute the *estimated gap statistic*:
		  we average the within sum of squares, computed for each repetition in correspondance to that certain $K$ (the log function maps positive values to $\mathbb{R}$ ).
		- We want for the gap statistic to be high. The higher the gap statistic, the higher the probability to have clusters in our data.
		- If it's close to zero, the WSS is the same as the one computed on the uniform distribution, so our data has no clusters.
		- We then select K as the smallest K such that:
$$GAP(K) \geq GAP(K+1) - s_{K+1}$$ - Basically we choose the first relative maximum in the graphical representation (the first peak).
		- In R:
			- The `fviz_nbclust()` function in the `factoextra` package computes elbow, silhouette and gap statistic.
			- The `NbClust()` function in the `NbClust` package provides 26 different indices for deciding the optimal number of clusters.
		- B is a synonym of bootstrapping (Montecarlo methods).
- The last step is cluster validation statistic:
	- We define measures to evaluate the quality of our partitions.
	- Internal measures for cluster validation
	- They focus on different aspects:
		1. Compactness (cohesion within the clusters)
		2. Separation (between the clusters)
		3. Connectivity (amount of observations allocated in the wrong clusters)

	-
	- Most indices used for internal clustering validation combine compactness and separation measure like this:
$$
		  \frac{\alpha \times \text{separation}}{\beta \times \text{compactness}}
		  $$ where $\alpha$ and $\beta$ are weights. We want to maximize these.
		- The Dunn index can be computed like this:
$$
			  D = \frac{\text{min.separation}}{\text{max.diameter}}
			  $$ where:
			* *min.separation* is the minimum distance between units belonging to different clusters;
			* *max.diameter* is the maximal intra-cluster distance;
			* both $\alpha$ and $\beta$ are equal to $1$ .

	- External cluster validation's aim is to compare the identified clusters to an external reference (classification).
		- We want to maximize the *corrected Rand index*, which varies from $-1$ (no agreement) to $1$ (perfect agreement).
		- We could also try to minimize the $VI$ index, which is a distance between partitions (*Meila variation index*).
			- #TODO: study the `classError()` function in the mclust library.
		- While the portion of misclassified elements can be interpreted alone, most other indices (like VI and Rand) only make sense when compared to other clustering methods!
		-
	-
	-
	-
	-
	-