Edge AI and Computing
Table of Contents

Feature engineering and data preprocessing are crucial steps in machine learning. They involve transforming raw data into meaningful features that improve model performance. These techniques simplify data transformations, enhance accuracy, and help models generalize better to unseen data.

From selecting relevant features to handling missing data and outliers, these processes are essential for building effective ML models. Proper feature engineering and preprocessing can significantly impact a model's ability to learn patterns and make accurate predictions, making them fundamental skills in AI and machine learning.

Feature engineering for ML

Importance and goals of feature engineering

  • Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning
  • Aims to simplify and speed up data transformations while also enhancing model accuracy
  • Properly engineered features can lead to improved model performance, generalization, and interpretability
    • Well-designed features capture relevant information and patterns in the data
    • Reduces noise and redundancy, making the learning process more efficient
    • Enhances the model's ability to generalize to unseen data
  • Requires a combination of domain knowledge, intuition, and experimentation to determine which features are most relevant to the problem at hand
    • Domain expertise helps identify meaningful features and relationships
    • Intuition guides the exploration and selection of potential features
    • Experimentation involves iteratively testing and refining feature sets
  • The quality and quantity of features heavily influence the performance of machine learning models
    • High-quality features provide discriminative information for the learning algorithm
    • Sufficient number of relevant features captures the complexity of the problem

Feature engineering techniques and considerations

  • Feature construction creates new features by combining or transforming existing features
    • Mathematical operations (addition, multiplication, logarithm) can capture relationships
    • Domain-specific formulas or equations can generate informative features
    • Interaction terms represent the combined effect of multiple features
  • Feature scaling ensures that features have similar ranges or distributions
    • Normalization scales features to a specific range (0-1)
    • Standardization transforms features to have zero mean and unit variance
    • Helps prevent features with larger values from dominating the learning process
  • Handling categorical variables converts them into numerical representations
    • One-hot encoding creates binary dummy variables for each category
    • Label encoding assigns unique numerical labels to each category
    • Target encoding replaces categories with their corresponding target variable statistics
  • Dimensionality reduction techniques reduce the number of features while preserving important information
    • Principal Component Analysis (PCA) performs linear transformation to capture maximum variance
    • t-SNE and UMAP are non-linear techniques for visualizing high-dimensional data
  • Feature selection identifies the most relevant and informative features for the model
    • Filter methods rank features based on statistical properties (correlation, chi-squared)
    • Wrapper methods evaluate feature subsets by training models on each subset
    • Embedded methods perform feature selection during the model training process

Feature selection and extraction

Filter and wrapper methods for feature selection

  • Filter methods select features based on their statistical properties, without involving any learning algorithms
    • Correlation measures the linear relationship between features and the target variable
    • Chi-squared test assesses the independence between categorical features and the target
    • Information gain quantifies the reduction in entropy achieved by splitting on a feature
    • Advantages: computationally efficient, independent of the learning algorithm
    • Disadvantages: ignore feature interactions and dependencies
  • Wrapper methods evaluate subsets of features by training a model on each subset and selecting the one with the best performance
    • Recursive Feature Elimination (RFE) iteratively removes the least important features
    • Forward selection starts with an empty set and adds features one by one
    • Backward elimination starts with all features and removes them iteratively
    • Advantages: consider feature interactions and the specific learning algorithm
    • Disadvantages: computationally expensive, prone to overfitting

Dimensionality reduction and domain-specific techniques

  • Principal Component Analysis (PCA) is an unsupervised linear transformation technique used for dimensionality reduction and feature extraction
    • Projects data onto a lower-dimensional space while maximizing variance
    • Eigenvectors of the covariance matrix become the new feature axes (principal components)
    • Helps identify latent factors and removes correlated features
  • Autoencoders are neural networks that learn a compressed representation of the input data, enabling non-linear feature extraction
    • Encoder maps input data to a lower-dimensional representation (bottleneck)
    • Decoder reconstructs the original data from the compressed representation
    • Bottleneck layer captures the most salient features of the data
  • Domain-specific techniques extract relevant features from specific data types
    • Mel-frequency cepstral coefficients (MFCCs) for audio data
      • Represent the short-term power spectrum of a sound
      • Capture the phonetic characteristics and timbre of speech or music
    • Histogram of Oriented Gradients (HOG) for image data
      • Counts occurrences of gradient orientation in localized portions of an image
      • Captures the local shape and appearance of objects

Data preprocessing and cleaning

Handling missing data and outliers

  • Data preprocessing involves transforming raw data into a format suitable for analysis and modeling
  • Data cleaning identifies and corrects errors, inconsistencies, and missing values in the dataset
  • Handling missing data can be done through various techniques:
    • Deletion: removing instances with missing values (listwise or pairwise deletion)
    • Imputation: filling in missing values with estimated values
      • Mean, median, or mode imputation for numerical features
      • Most frequent category imputation for categorical features
      • Advanced methods like k-Nearest Neighbors (kNN) or Matrix Factorization
  • Outlier detection and removal can be performed using different approaches:
    • Statistical methods: z-score, Interquartile Range (IQR)
      • Z-score measures how many standard deviations an observation is from the mean
      • IQR identifies outliers based on the range between the first and third quartiles
    • Density-based techniques: DBSCAN, Local Outlier Factor (LOF)
      • DBSCAN groups together densely packed points and marks isolated points as outliers
      • LOF computes the local density deviation of a point with respect to its neighbors

Data normalization, standardization, and encoding

  • Data normalization scales the features to a specific range (e.g., 0-1)
    • Min-max scaling: $X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$
    • Prevents features with larger values from dominating the learning process
    • Useful when the distribution of the data is not known or not important
  • Standardization transforms the features to have zero mean and unit variance
    • Z-score standardization: $X_{std} = \frac{X - \mu}{\sigma}$
    • Makes features comparable and suitable for certain algorithms (e.g., SVM, neural networks)
    • Assumes the data follows a Gaussian distribution
  • Encoding categorical variables into numerical representations is necessary for most machine learning algorithms
    • One-hot encoding creates binary dummy variables for each category
      • Each category is represented by a binary vector with a single 1 and rest 0s
      • Suitable for nominal categorical variables (no inherent order)
    • Label encoding assigns unique numerical labels to each category
      • Each category is mapped to an integer value
      • Suitable for ordinal categorical variables (inherent order)
    • Target encoding replaces categories with their corresponding target variable statistics
      • Captures the relationship between the categorical variable and the target
      • Helps handle high-cardinality categorical variables

Applying feature engineering and preprocessing

Exploratory Data Analysis (EDA) and domain knowledge

  • Exploratory Data Analysis (EDA) helps understand the dataset's characteristics, identify patterns, and guide feature engineering decisions
    • Univariate analysis examines individual features (distribution, central tendency, dispersion)
    • Bivariate analysis explores relationships between pairs of features (correlation, scatter plots)
    • Multivariate analysis investigates interactions among multiple features (heatmaps, pair plots)
  • Domain knowledge is crucial for creating meaningful and informative features specific to the problem domain
    • Understanding the underlying processes and factors influencing the data
    • Identifying relevant variables and their expected relationships
    • Incorporating expert insights and industry-specific knowledge

Iterative feature selection and preprocessing pipeline

  • Feature selection techniques should be applied iteratively, evaluating the impact of each feature on the model's performance
    • Start with a broad set of potential features
    • Use filter methods to rank and select top features
    • Evaluate the performance of models trained on different feature subsets
    • Refine the feature set based on model performance and domain knowledge
  • Data preprocessing pipeline should be designed to handle the specific characteristics and requirements of the dataset
    • Determine the appropriate handling of missing data (deletion, imputation)
    • Apply suitable scaling or normalization techniques based on the data distribution and algorithm requirements
    • Encode categorical variables using appropriate methods (one-hot, label, target encoding)
  • Preprocessing steps should be applied consistently to both training and testing data to avoid data leakage and ensure the model's generalization ability
    • Data leakage occurs when information from the test set is used during training
    • Fit preprocessing transformations (scaler, encoder) on the training data and apply them to the test data
  • Cross-validation techniques, such as k-fold or stratified k-fold, help assess the robustness of the feature engineering and preprocessing choices
    • Divide the data into k subsets (folds)
    • Train and evaluate the model k times, using each fold as the test set once
    • Provides a more reliable estimate of the model's performance and generalization
  • Monitoring and updating the feature engineering and preprocessing pipeline is necessary as new data becomes available or the underlying data distribution changes over time
    • Regularly assess the performance of the model on new data
    • Adapt the feature engineering and preprocessing steps to handle evolving data characteristics
    • Retrain the model with updated features and preprocessing techniques