Unsupervised Learning for Clustering and Dimensionality Reduction

4.1. Unsupervised Learning for Clustering and Dimensionality Reduction#

4.1.1. Learning objectives:#

Distinguish supervised from unsupervised learning
Understand the necessity of reducing dimensionality for big datasets
- Know at least two approaches for dimensionality reduction
- Understand the steps of PCA
Distinguish clustering from supervised classification
- Know how to implement the Kmeans algorithm and select the number of clusters
- Know how to implement Gaussian mixture models and how to detect anomalies

4.1.2. Unsupervised Learning (UL) v.s Supervised Learning (SL)#

UL

Trains models without labeled output data.
Discovers patterns, groupings, or structures in data.
Includes techniques like clustering, dimensionality reduction, and density estimation.
Useful when specific output labels are unknown or unavailable.
Example: Land Cover type identification via clustering.

SL

Trains models with labeled examples for predictions.
Classifies data into predefined categories.
Common tasks: classification, regression, object detection.
Requires labeled data for model training.
Example: Predicting NDVI values based on multiple labelled data.

Semi-SL

Combines elements of both unsupervised and supervised learning.
Uses a small portion of labeled data and a larger amount of unlabeled data.
Aims to leverage labeled data for improved model performance.
Suited for scenarios with limited labeled data availability.
Example: Tracking animal species using both labeled and unlabeled camera trap images.

4.1.3. Dimension Reduction#

Dimensionality reduction is the technique of reducing the number of features in a dataset while retaining essential information. It aids data visualization, analysis, and enhances machine learning model performance.

Dimension reduction figure

The mechanism of dimension reductios states as this figure: Left one is the original Swiss roll dataset, the middle is the data that is squashing by projecting onto a plane, and the right one is the unrolling dataset.

The Curse of Dimensionality

The curse of dimensionality refers to the challenges and issues that arise as data dimensionality increases, leading to increased sparsity, computational complexity, and decreased efficiency in various machine learning and data analysis tasks.
High dimensional datasets are likely very sparse, with training instances far away from each other, which increases the risk of overfitting.

Main Methods for Dimension Reduction

Projection Methods: Linearly transform data to lower dimensions, e.g. Principal Component Analysis (PCA)
Manifold Learning: Captures underlying data structure for nonlinear relationships, e.g. t-Distributed Stochastic Neighbor Embedding (t-SNE), Locally Linear Embedding (LLE)

Table figure

Principal Component Analysis (PCA)

PCA is a dimensionality reduction method that identifies and preserves the most important information in a dataset while reducing its dimensionality. It employs the Singular Value Decomposition (SVD) technique to find the principal components.
PCA identifies orthogonal axes (principal components) that maximize variance in the data. It projects the data onto these components, effectively reducing dimensionality.
Other PCA techniques, such as Incremental PCA, Randomized PCA, and Kernel PCA, offer variations and optimizations for specific use cases.
It’s best to choose the number of dimensions that capture a significant percentage of the variance, for e.g. 95%. The objective here is to reduce information loss while minimizing dimensionality.
PCA, along with other dimension reduction techniques, is applied in environmental sciences to uncover patterns such as El Niño and variance modes, diminish collinearity among variables, identify pollution sources in ambient air and soil, compare water quality in different watersheds, and quantify phenotypic variations amongst species based on multiple measurements, and more, aiding in various environmental analyses and modeling.

4.1.4. Clustering#

Clustering v.s Classification

Clustering groups data into clusters based on similarities, without predefined labels. Classification classifies data into predefined categories using labeled examples. Clustering discovers patterns and relationships, while classification predicts labels based on known outcomes.
Clustering assists in categorizing ecosystems based on characteristics, supporting climate change and stressor analysis. It also aids in urban planning by identifying built environment patterns.

K-means clustering

K-means clustering divides data into K clusters by minimizing the sum of squared distances between points and cluster centroids. It iteratively assigns points to the nearest centroid and updates centroids.
K-Means Clustering Steps:
1. Initialization: Randomly select K initial centroids.
2. Assignment: Assign each point to the nearest centroid.
3. Update Centroids: Recalculate centroids based on cluster points.
4. Reassignment: Repeat steps 2 and 3 until convergence.
5. Convergence: Stop when centroids stabilize or after a set number of iterations.
6. Clusters: Resulting centroids define distinct clusters in the data.
The K-means algorithm is fast and scalable but struggles with clusters of varying sizes, densities, and nonspherical shapes.Inertia quantifies cluster quality. The silhouette score and the “elbow” method determines the ideal number of clusters (k).
Accelerated K-Means and Mini-batch K-Means are advanced variants of K-Means clustering designed to enhance the algorithm’s speed and efficiency, particularly for large datasets.

Gaussian mixture models (GMM)

GMM is a probabilistic model for representing data as a mixture of several Gaussian distributions.
It employs the Expectation-Maximization (EM) algorithm to categorize instances into either hard clusters (clearly defined) or soft clusters (with estimated probabilities).
In GMM, the likelihood function is used to determine how plausible is a particular set of parameters for a given a known outcome. The values that maximize the likelihood function are generally the most likely values for the model parameters.
Selecting the appropriate number of clusters involves minimizing information criteria like the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC), considering factors such as the number of dimensions, instances, and clusters.
GMM helps anomaly detection by modeling normal data distribution. Anomalies are identified as data points with low probability under the GMM, serving as outliers. Meanwhile, there are other several algorithms available for anomaly detection, such as Fast-MCD, Isolation forest, and local outlier factor.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a clustering algorithm that identifies clusters as continuous regions of high data density. It excels when clusters have varying densities and are separated by lower-density regions. DBSCAN categorizes data points into core instances (well within dense areas) and border instances (on cluster fringes) based on their proximity to other data points. Anomalies are data points that are neither core instances nor have nearby core instances. This method is robust to outliers, can handle clusters of different shapes, and is useful for a variety of applications.

Examples of Dimensionality Reduction and Unsupervised Learning in Environmental Science🌄:

Dimensionality reduction:

PCA helps identify pollution sources in ambient air and soil, compare water quality in different watersheds, and quantify phenotypic variations amongst species based on multiple measurements.

Clustering:

Clustering can be used to group different types of ecosystems together, based on their characteristics such as vegetation, wildlife, and climate. This information can be used to understand how different ecosystems respond to climate change and other environmental stressors.

Cluster analysis can be used to identify built environmental patterns.

GMMs could be applied to stratified lake water samples to identify distinct water quality profiles based on their chemical composition.

DBSCAN can recognize zones where the air pollution is over certain criteria and help the decision-makers to take measures.

Tips and Tricks 💡

Exercise 1: Dimensionality Reduction

Does PCA always reduce model training time and increase model performance?

Load the MNIST dataset and split it into a training set and a test set;
Train a Random Forest classifier on the dataset,
Time how long it takes,
Evaluate the resulting model on the test set.
Train a Logistic Regression classifier on the dataset,
Time how long it takes,
Evaluate the resulting model on the test set.
Use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%.
Train a new Random Forest classifier on the reduced dataset. Was training much faster? Was the performance better?
Train a new Logistic Regression classifier on the reduced dataset. Was training much faster? Was the performance better?

Exercise 2: Clustering

How to choose the number of clusters when using K-means?

Load the MNIST dataset;
Time one K-Means training;
Use PCA for dimension reduction;
Train K-Means with multiple ks;
Calculate the performance of different k - silhouette score;
Visualize silhouette score & inertia against k;
Visualize clusters.

Exercise 3: Application to Dynamical Regime Identification - Tracking the impact of global Heating on Ocean Regimes (THOR)

Reading: Transparent Machine learning (ML) method that explains the governing mechanisms of the North Atlantic Meridional Overturning Circulation (AMOC) called Tracking global Heating with Ocean Regimes (THOR).

Transparent ML
Dynamics contributing to AMOC changes under a global heating model

The paper demonstrates practical applications of machine learning techniques, including clustering, to analyze environmental data. Specifically, clustering is employed to categorize distinct dynamical regimes in the North Atlantic Circulation, as depicted in Figure 4. The authors utilize an Ensemble MLP trained with labeled data obtained through unsupervised machine learning, emphasizing six dynamical regimes related to oceanic transport and circulation patterns in the North Atlantic.

Exercise: Step1 of THOR - Identify 2D dynamical regimes

Data
- Reduced to 5 dimensions: (1) curlA; (2) curlB, (3) curlTau, (4)curlCori, (5) BPT,
- i.e., with shape (360, 720, 5) - 5 layers of 720x360 images, each pixel/cell has 5 features;
- pixels/cells to be clustered into groups based on these features.
Use Xarray to format data.
Use K-Means to cluster the 5D training data;
Visualize identified clusters.

Estimating the Circulation and Climate of the Ocean dynamical regimes geographical expanse, area averaged term magnitudes and learning contributions. Figure credit: (Sonnewald et al., 2019)