Machine Learning Engineering

2.2 Exploratory Data Analysis

Citation:

Exploratory Data Analysis is a crucial step in understanding your dataset before diving into modeling. It involves visualizing data, calculating statistics, and identifying patterns to gain insights and guide feature engineering decisions.

In this part of Data Preparation and Feature Engineering, you'll learn techniques to uncover data distributions, relationships, and quality issues. These skills will help you make informed choices about data cleaning, transformation, and feature selection for your machine learning projects.

Data Visualization and Interpretation

Visualization Techniques for Data Distribution

Histograms and kernel density plots visualize distribution of continuous variables revealing patterns (skewness, modality, outliers)
Box plots and violin plots provide insights into spread, central tendency, and potential outliers of numerical variables allowing easy comparison across categories
Scatter plots and pair plots visualize relationships between two or more continuous variables helping identify correlations and potential feature interactions
Heat maps visualize correlation matrices and identify patterns in high-dimensional data particularly in feature selection processes
Time series plots analyze temporal data revealing trends, seasonality, and potential anomalies in sequential observations

Advanced Visualization and Dimensionality Reduction

Dimensionality reduction techniques (PCA, t-SNE) create 2D or 3D visualizations of high-dimensional data aiding in cluster identification and feature importance analysis
Interactive visualizations enable dynamic exploration of data relationships and patterns
Parallel coordinates plots visualize high-dimensional data and identify clusters or outliers
Treemaps display hierarchical data structures and relative proportions of categories
Force-directed graphs visualize network data and complex relationships between entities

Statistical Measures and Analysis

Measures of Central Tendency and Dispersion

Measures of central tendency (mean, median, mode) provide insights into typical or average values in a dataset offering different perspectives on data distribution
Measures of dispersion quantify spread of data points crucial for understanding data variability and identifying potential outliers
- Variance: average squared deviation from the mean
- Standard deviation: square root of variance, in same units as original data
- Range: difference between maximum and minimum values
- Interquartile range (IQR): difference between 75th and 25th percentiles
Skewness measures describe asymmetry of data distributions (positive skew: right tail, negative skew: left tail)
Kurtosis measures indicate presence of heavy tails in data distributions (leptokurtic: heavy tails, platykurtic: light tails)

Correlation and Hypothesis Testing

Correlation coefficients quantify strength and direction of relationships between variables essential for feature selection and multicollinearity detection
- Pearson correlation: measures linear relationships between continuous variables
- Spearman correlation: assesses monotonic relationships, robust to outliers
- Kendall's tau: measures ordinal association between variables
Covariance matrices provide insights into joint variability of multiple variables crucial for understanding feature interactions and dimensionality reduction techniques
Robust statistics offer alternatives when dealing with datasets containing outliers or non-normal distributions
- Median absolute deviation: robust measure of variability
- Huber's M-estimator: robust alternative to mean for location parameter estimation
Statistical hypothesis tests assess significance of observed patterns and relationships in data
- t-tests: compare means of two groups
- ANOVA: analyze variance between multiple groups
- Chi-square tests: assess independence between categorical variables

Data Quality and Bias

Missing Data and Outliers

Missing data patterns and mechanisms must be identified and addressed to prevent biased model training and inaccurate predictions
- MCAR (Missing Completely at Random): missingness independent of observed and unobserved data
- MAR (Missing at Random): missingness depends only on observed data
- MNAR (Missing Not at Random): missingness depends on unobserved data
Outliers and anomalies should be detected using statistical methods and domain knowledge to determine impact on model performance
- Z-score method: identifies points beyond a certain number of standard deviations from the mean
- Interquartile Range (IQR) method: flags points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR
- DBSCAN clustering: identifies outliers as points not belonging to any cluster

Class Imbalance and Data Bias

Class imbalance in classification problems can lead to biased models requiring techniques to address the issue
- Oversampling: increase instances of minority class (SMOTE, ADASYN)
- Undersampling: reduce instances of majority class (random undersampling, Tomek links)
- Synthetic data generation: create artificial samples to balance classes
Multicollinearity among features can impact model interpretability and stability necessitating feature selection or dimensionality reduction techniques
Selection bias in data collection or sampling processes can lead to models that do not generalize well to target population
- Sampling bias: certain groups are over- or under-represented in the data
- Volunteer bias: participants self-select into a study, potentially skewing results
- Survivorship bias: focusing only on entities that have "survived" a selection process

Temporal Effects and Data Leakage

Temporal effects may impact model performance over time and should be identified through time series analysis and domain expertise
- Concept drift: gradual change in the statistical properties of the target variable
- Seasonality: regular and predictable patterns that repeat over fixed intervals
Data leakage must be carefully avoided through proper data splitting and feature engineering practices
- Target leakage: using future information to predict past events
- Train-test contamination: information from test set influencing model training

Data Insights and Hypothesis Generation

Feature Importance and Domain Knowledge

Domain knowledge integration guides selection of relevant features and interpretation of observed patterns
Feature importance analysis techniques help identify key variables driving target variable or outcome of interest
- Correlation analysis: measures linear relationships between features and target
- Mutual information: captures non-linear dependencies between variables
- Random forest feature importance: measures decrease in model performance when a feature is randomly permuted

Clustering and Anomaly Detection

Clustering algorithms discover natural groupings within data potentially revealing hidden patterns or subpopulations
- K-means: partitions data into k clusters based on centroid proximity
- Hierarchical clustering: creates a tree-like structure of nested clusters
- DBSCAN: density-based clustering for identifying clusters of arbitrary shape
Anomaly detection methods uncover unusual observations or patterns warranting deeper analysis
- Isolation Forest: isolates anomalies by randomly partitioning the data
- One-class SVM: learns a decision boundary to classify new points as inliers or outliers
- Autoencoders: detect anomalies based on reconstruction error

Latent Structure Analysis and Hypothesis Formulation

Exploratory factor analysis and principal component analysis reveal latent structures in data leading to new feature engineering opportunities
- Factor analysis: identifies underlying latent variables explaining observed correlations
- PCA: reduces dimensionality while preserving maximum variance in the data
Visual analytics techniques enable dynamic exploration of complex datasets facilitating hypothesis generation
- Interactive dashboards: allow real-time filtering and exploration of data
- Linked views: connect multiple visualizations to provide different perspectives on the same data
Formulating clear testable hypotheses based on exploratory findings guides subsequent modeling efforts and experimental design
- Null hypothesis: statement of no effect or relationship
- Alternative hypothesis: statement of expected effect or relationship
- p-value: probability of observing results as extreme as those obtained, assuming null hypothesis is true

Table of Contents

🧠machine learning engineering review