scoresvideos
Statistical Methods for Data Science
Table of Contents

Dimensionality reduction techniques go beyond PCA, offering diverse methods to simplify complex datasets. From non-linear approaches like t-SNE and UMAP to linear methods like LDA and ICA, these tools tackle different data challenges.

Neural network-based techniques like autoencoders provide powerful alternatives for reducing dimensions. These methods, along with matrix factorization and manifold learning, expand our toolkit for handling high-dimensional data effectively.

Manifold Learning Techniques

t-SNE and UMAP: Non-linear Dimensionality Reduction

  • t-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction technique
    • Preserves local structure of high-dimensional data in low-dimensional space
    • Converts high-dimensional Euclidean distances between data points into conditional probabilities representing similarities
    • Minimizes the divergence between joint probabilities in high-dimensional and low-dimensional space using gradient descent
  • UMAP (Uniform Manifold Approximation and Projection) is another non-linear dimensionality reduction method
    • Constructs a high-dimensional graph representation of the data and optimizes a low-dimensional graph to be as structurally similar as possible
    • Assumes the data is uniformly distributed on a Riemannian manifold and tries to learn the manifold's local metric
    • Faster than t-SNE and better preserves global structure (clusters at different scales)

Multidimensional Scaling (MDS)

  • MDS is a technique used for visualizing the level of similarity of individual cases in a dataset
  • Aims to find a low-dimensional representation of the data where the distances between points are preserved as well as possible
    • Classical MDS: Uses eigenvector decomposition to preserve pairwise distances exactly in the low-dimensional space
    • Non-metric MDS: Preserves the rank order of the pairwise distances (used for ordinal data)
  • Stress function measures the discrepancy between the distances in the low-dimensional space and the original dissimilarities
  • Applications include visualizing the relationships between objects (cities on a map) or individuals (based on survey responses)

Linear Dimensionality Reduction Methods

Supervised and Unsupervised Linear Methods

  • Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction method
    • Finds a linear combination of features that best separates the classes
    • Projects the data onto a lower-dimensional space while maximizing the separation between classes
    • Assumes the data is normally distributed and the classes have equal covariance matrices
  • Independent Component Analysis (ICA) is an unsupervised method for separating a multivariate signal into additive subcomponents
    • Assumes the subcomponents are non-Gaussian and statistically independent
    • Finds a linear transformation that minimizes the statistical dependence between the components
    • Applications include blind source separation (cocktail party problem) and feature extraction

Matrix Factorization Techniques

  • Non-negative Matrix Factorization (NMF) is a dimensionality reduction technique that factorizes a non-negative matrix into two non-negative matrices
    • Finds a low-rank approximation of the original matrix: $V \approx WH$, where $V$, $W$, and $H$ are non-negative
    • Interpretable parts-based representation: Each column of $W$ represents a basis vector, and each column of $H$ represents the coefficients
    • Applications include image processing (facial recognition), text mining (topic modeling), and recommender systems

Neural Network-based Dimensionality Reduction

Autoencoders

  • Autoencoders are neural networks trained to reconstruct their input data
    • Consist of an encoder that maps the input to a lower-dimensional latent space and a decoder that reconstructs the input from the latent representation
    • Bottleneck layer in the middle has a lower dimensionality than the input, forcing the network to learn a compressed representation
  • Types of autoencoders:
    • Undercomplete autoencoders: Latent space has lower dimensionality than the input, used for dimensionality reduction
    • Regularized autoencoders: Add regularization terms to the loss function to learn more robust representations (sparse, contractive, or denoising autoencoders)
    • Variational autoencoders (VAEs): Latent space is constrained to follow a prior distribution (usually Gaussian), enabling generation of new samples