Support Vector Machines (SVMs) are powerful supervised learning models used for classification and regression. They find the optimal hyperplane that maximally separates different classes in feature space, maximizing the margin between decision boundaries and nearest data points.
SVMs use mathematical optimization to balance margin maximization and error minimization. They employ kernel functions to handle non-linear relationships and can be extended to multi-class problems. Hyperparameter tuning is crucial for optimal SVM performance in various applications.
Support Vector Machines
- Supervised learning models used for classification and regression tasks
- Primary focus finds optimal hyperplane maximally separating different classes in feature space
- Margin represents distance between decision boundary and nearest data points from each class (support vectors)
- Maximizes margin while minimizing classification errors leads to better generalization on unseen data
- Solves constrained optimization problem using quadratic programming techniques
- Primal form includes weight vector w and bias term b
- Dual form introduces Lagrange multipliers
- Soft-margin SVMs incorporate slack variables allowing misclassifications
- Balances trade-off between maximizing margin and minimizing classification errors
- Kernel trick enables operation in high-dimensional feature spaces without explicit coordinate computation
- Allows for non-linear decision boundaries
Mathematical Optimization
- Optimization problem formulation maximizes margin and minimizes errors
- Primal form objective function: minw,b21∣∣w∣∣2+C∑i=1nξi
- Constraints: yi(wTxi+b)≥1−ξi and ξi≥0 for all i
- Dual form derived using Lagrange multipliers
- Dual optimization problem: maxα∑i=1nαi−21∑i,j=1nαiαjyiyjK(xi,xj)
- Subject to constraints: 0≤αi≤C and ∑i=1nαiyi=0
- Solution yields support vectors and decision boundary
SVM Classification
Binary Classification
- Finds decision boundary separating two classes
- Maximizes margin between classes
- Decision function: f(x)=sign(wTx+b)
- w represents weight vector
- x represents input feature vector
- b represents bias term
- Classifies new data points based on sign of decision function
- Geometric interpretation visualizes hyperplane in feature space (2D line, 3D plane)
- Margin width calculated as ∣∣w∣∣2
Multi-class Classification Strategies
- One-vs-Rest (OvR) approach
- Trains K binary classifiers for K classes
- Each classifier separates one class from all others
- Assigns class with highest confidence score
- Example: For 3-class problem (A, B, C), train classifiers A vs (B,C), B vs (A,C), C vs (A,B)
- One-vs-One (OvO) approach
- Trains K(K-1)/2 binary classifiers for all possible class pairs
- Uses voting or aggregation methods to determine final class
- Example: For 3-class problem, train classifiers A vs B, A vs C, B vs C
- Error-Correcting Output Codes (ECOC)
- Encodes classes into binary problems
- Uses error-correcting techniques to enhance performance
- Example: Binary code [1,-1,1] for class A, [-1,1,1] for B, [1,1,-1] for C
- Probabilistic outputs obtained using Platt scaling
- Fits logistic regression model to SVM scores
- Provides probability estimates for each class
Kernel Functions for SVMs
Common Kernel Functions
- Enable operation in high-dimensional feature spaces without explicit coordinate computation
- Replace dot products in original feature space with kernel function evaluations
- Map data to higher-dimensional space where it becomes linearly separable
- Linear kernel: K(x,y)=xTy
- Equivalent to no transformation
- Suitable for linearly separable data or high-dimensional spaces
- Polynomial kernel: K(x,y)=(γxTy+r)d
- Allows for non-linear decision boundaries
- Captures feature interactions up to degree d
- Example: For d=2, creates quadratic decision boundary
- Radial Basis Function (RBF) kernel: K(x,y)=exp(−γ∣∣x−y∣∣2)
- Widely used for complex, non-linear relationships
- Creates circular or elliptical decision boundaries
- Sigmoid kernel: K(x,y)=tanh(γxTy+r)
- Inspired by neural networks
- Suitable for certain non-linear problems
Kernel Selection and Custom Kernels
- Kernel choice depends on data distribution and problem characteristics
- Linear kernel for high-dimensional, linearly separable data (text classification)
- RBF kernel for low-dimensional, non-linear data (image recognition)
- Polynomial kernel for problems with clear degree of interaction (physical systems)
- Custom kernel functions designed for specific problem domains
- Must satisfy Mercer's theorem to ensure valid kernel matrix
- Example: String kernel for text data, graph kernel for molecular structures
- Kernel parameter tuning crucial for optimal performance
- γ in RBF kernel controls influence of single training example
- d in polynomial kernel determines complexity of decision boundary
SVM Hyperparameter Optimization
Key Hyperparameters and Optimization Techniques
- Regularization parameter C
- Controls trade-off between margin maximization and error minimization
- Larger C leads to harder margin, potentially overfitting
- Smaller C allows more misclassifications, potentially underfitting
- Kernel-specific parameters
- γ for RBF kernel influences decision boundary smoothness
- d for polynomial kernel determines feature interaction complexity
- Kernel function choice itself considered a hyperparameter
- Grid search evaluates model performance across parameter combinations
- Example: C values [0.1, 1, 10, 100], γ values [0.01, 0.1, 1, 10]
- Random search samples parameter combinations from specified distributions
- More efficient for high-dimensional parameter spaces
- Cross-validation assesses generalization performance
- k-fold cross-validation (k typically 5 or 10) provides robust estimates
- Bayesian optimization efficiently searches hyperparameter space
- Builds probabilistic model of objective function
- Suitable for computationally expensive models
Preprocessing and Model Selection
- Feature scaling and normalization crucial for SVM performance
- Standardization (zero mean, unit variance) commonly used
- Min-max scaling to [0,1] range alternative for bounded features
- Model selection criteria guide optimization process
- Accuracy for balanced datasets
- F1-score for imbalanced datasets
- Area under ROC curve (AUC-ROC) for ranking performance
- Learning curves help diagnose overfitting or underfitting
- Plot training and validation performance against training set size
- Guides decisions on collecting more data or adjusting model complexity