⛱️Cognitive Computing in Business Unit 3 – Machine Learning Essentials
Machine learning is a subset of AI that enables systems to learn and improve from experience without explicit programming. It uses algorithms to build models based on training data, making predictions or decisions in various applications like email filtering and computer vision.
Key concepts include datasets, features, labels, and models. The machine learning process involves defining problems, preparing data, choosing models, training, evaluating, and deploying. Popular algorithms range from linear regression to neural networks, with tools like Python and TensorFlow supporting implementation.
Machine learning (ML) is a subset of artificial intelligence that focuses on building systems that can learn and improve from experience without being explicitly programmed
ML algorithms build mathematical models based on sample data, known as training data, to make predictions or decisions without being explicitly programmed to do so
As the models are exposed to new data, they adapt and learn from previous computations to produce reliable, repeatable decisions and results
ML is closely related to computational statistics, which focuses on making predictions using computers, but not all ML is statistical learning
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning
ML algorithms are used in a wide variety of applications (email filtering, computer vision, recommendation engines) where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks
Spam filtering is a common example, where ML algorithms learn to flag spam based on words or phrases found in the email text
ML enables analysis of massive quantities of data, delivering faster, more accurate results to identify profitable opportunities or dangerous risks
Key Machine Learning Concepts
Dataset is a collection of data used to train and test ML models, typically split into training, validation, and test sets
Features are the input variables used in making predictions, represented as columns in tabular datasets or certain abstract qualities (texture, color, etc.) in unstructured data like images
Labels are the output variable or target that the model is trying to predict, based on the features
Model is the mathematical representation of a real-world process, trained on historical data to make predictions on new data by capturing the relationship between features and label
Training is the process of feeding data to the model so it can learn the relationships between features and label
The goal is to minimize the difference between the model's predictions and the actual labels, known as the loss or cost function
Inference is when the trained model is used to make predictions on new, unseen data
Hyperparameters are the settings used to control the model training process (learning rate, number of hidden layers in a neural network, etc.)
These are set before training and are not learned from data like model parameters
Types of Machine Learning
Supervised learning uses labeled datasets to train algorithms that to classify data or predict outcomes accurately
Input data is called training data and has a known label or result (historical stock prices, images of dogs labeled "dog", etc.)
Model is trained until it can detect the underlying patterns and relationships, enabling it to yield accurate labeling results when presented with never-before-seen data
Unsupervised learning is used on data with no historical labels, allowing the algorithm to act on that data without guidance
Unsupervised learning can discover hidden patterns or data groupings (customer segments) without the need for human intervention
Commonly used for transactional data (identifying segments of customers with similar attributes who can then be treated similarly in marketing campaigns)
Semi-supervised learning uses a mix of labeled and unlabeled data, usually a small amount of labeled data with a large amount of unlabeled data
Can address real-world problems where labeled data is scarce or expensive, but unlabeled data is abundant
Reinforcement learning trains models to make a sequence of decisions by exposing the model to the environment where it trains itself based on feedback
Learns by trial and error to capture the best possible knowledge to make accurate decisions
Often used in gaming, robotics, and navigation
The Machine Learning Process
Define problem and gather data from various sources (databases, sensors, APIs) and different formats (tables, images, text)
Prepare data by cleaning (handling missing values, removing duplicates), transforming (scaling, encoding categorical variables), and splitting into train, validation, and test sets
Exploratory data analysis examines data to understand its main characteristics (mean, standard deviation, correlation, etc.)
Feature engineering creates new input features from existing ones to improve model performance
Choose a model based on the problem type (classification, regression, clustering), data size and type, and resource constraints
Train the model on the training data, tuning hyperparameters to improve performance
Validation data provides an unbiased evaluation of model performance while tuning hyperparameters
Evaluate model performance on the test set using appropriate metrics (accuracy, precision, recall, F1-score)
Deploy model into production environment to start generating predictions on real-world data
Monitor and maintain model performance over time, retraining or updating as needed based on new data or changing environment
Popular ML Algorithms
Linear Regression predicts continuous values (house prices) by fitting a linear equation to observed data
Logistic Regression predicts binary outcomes (spam or not spam) by fitting a logistic function to observed data
Decision Trees predict outcomes by learning simple decision rules inferred from the data features
Random Forest is an ensemble of decision trees, making predictions by aggregating the predictions of multiple trees
Support Vector Machines find a hyperplane in N-dimensional space (N = number of features) that distinctly classifies the data points
Naive Bayes classifiers are a family of probabilistic algorithms based on applying Bayes' theorem with strong independence assumptions between the features
K-Means Clustering groups unlabeled data into K clusters based on feature similarity
Neural Networks are inspired by biological neural networks, consisting of input, hidden, and output layers of interconnected nodes
Deep Learning uses neural networks with many hidden layers to learn hierarchical representations of data
Tools and Frameworks
Python is the most popular programming language for ML due to its extensive libraries (NumPy for numerical computing, Pandas for data manipulation, Matplotlib for visualization)
Scikit-learn is a Python library that provides a wide range of supervised and unsupervised learning algorithms
TensorFlow is an open-source library for dataflow and differentiable programming, used for ML applications such as neural networks
Keras is a high-level neural networks API written in Python, capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML
PyTorch is an open source ML library based on Torch, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab
Apache Spark is a fast and general-purpose cluster computing system that provides APIs to work with large datasets, including Spark MLlib for ML
Cloud platforms like Amazon Web Services, Google Cloud, and Microsoft Azure offer managed ML services for building, training, and deploying models at scale
Real-World Business Applications
Fraud Detection uses ML to identify suspicious patterns and prevent fraudulent transactions in industries like banking and insurance
JPMorgan Chase uses ML to detect fraud and money laundering, saving $150 million annually
Recommendation Systems suggest relevant products or content to users based on past behavior and similar users' preferences
Netflix uses ML to personalize movie and TV recommendations, saving $1 billion annually in customer retention
Customer Churn Prediction helps businesses identify customers at high risk of leaving, allowing proactive retention efforts
Verizon uses ML to predict customer churn and improve retention by 1-5%
Dynamic Pricing optimizes product prices in real-time based on demand, competitor prices, and other market factors
Uber uses ML for surge pricing to balance supply and demand by raising prices when demand outstrips supply
Predictive Maintenance anticipates equipment failures to allow maintenance to be scheduled before the failure occurs, preventing unexpected downtime
General Electric uses ML to predict jet engine failures, reducing flight delays and cancellations
Image and Video Analysis extracts insights from visual data for applications like medical diagnosis, defect detection in manufacturing, and autonomous vehicles
IBM Watson uses ML to assist doctors in diagnosing diseases like cancer from medical images
Natural Language Processing enables computers to understand, interpret, and generate human language for applications like sentiment analysis, chatbots, and machine translation
Google Translate uses ML to provide real-time language translation for over 100 languages
Challenges and Limitations
Data quality issues like missing values, outliers, and noise can significantly impact model performance
Garbage in, garbage out - ML models are only as good as the data they are trained on
Model interpretability is a challenge, particularly for complex models like deep neural networks
Black box models make it difficult to understand how the model arrives at its predictions, which can be problematic in regulated industries
Bias in training data can lead to biased predictions, perpetuating or even amplifying societal biases
Amazon had to scrap an ML recruiting tool that showed bias against women due to historical hiring data
Overfitting occurs when a model learns the noise in the training data to the extent that it negatively impacts the performance on new data
Regularization techniques (L1/L2 regularization, dropout) can help mitigate overfitting
Deployment and scaling ML models in production environments can be challenging due to computational requirements and the need to retrain models as new data becomes available
Techniques like model compression and quantization can help reduce model size and inference latency
Adversarial attacks involve malicious actors manipulating input data to fool ML models and cause misclassifications
Adversarial training incorporates adversarial examples in the training data to improve model robustness
Concept drift occurs when the statistical properties of the target variable change over time, leading to a degradation in model performance
Continuous monitoring and retraining of models is necessary to adapt to changing environments