Edge AI and Computing
Table of Contents

Neural networks are the backbone of deep learning, mimicking the human brain's structure. They consist of interconnected neurons organized in layers, processing data from input to output through hidden layers that learn complex patterns.

Understanding neural networks is crucial for grasping deep learning architectures. Key concepts include network structure, activation functions, forward and backpropagation, and optimization algorithms. These elements work together to enable powerful machine learning models.

Neural Network Structure

Components and Layers

  • Neural networks are composed of interconnected nodes called neurons, organized into layers: an input layer, one or more hidden layers, and an output layer
  • Input layer neurons receive the initial data or features (pixel values, text embeddings), while output layer neurons produce the final predictions or classifications (image labels, sentiment scores)
  • Hidden layers are responsible for learning complex patterns and representations from the input data
    • They enable the network to capture non-linear relationships and hierarchical features
    • The number and size of hidden layers determine the depth and capacity of the network
  • Each neuron in a layer is connected to neurons in the subsequent layer through weighted connections, where weights represent the strength and importance of the connections
    • Positive weights indicate an excitatory connection, while negative weights indicate an inhibitory connection

Bias Terms and Flexibility

  • Bias terms are added to each neuron to introduce flexibility and shift the activation function, allowing the network to learn more complex patterns
    • They act as an additional input with a constant value of 1, with its own learnable weight
    • Bias terms help the network adapt to different input distributions and decision boundaries
  • The combination of weighted inputs and bias terms allows the network to learn both linear and non-linear transformations of the input data
    • Linear transformations can be learned by adjusting the weights, while non-linear transformations are introduced by the activation functions
  • The flexibility provided by bias terms and non-linear activation functions enables neural networks to approximate complex functions and solve a wide range of tasks (image classification, language translation, speech recognition)

Activation Functions in Neural Networks

Role of Activation Functions

  • Activation functions are mathematical functions applied to the weighted sum of inputs in each neuron to introduce non-linearity into the network
    • They determine the output of a neuron based on its input
    • Non-linearity is crucial for neural networks to learn and model complex, non-linear relationships in the data
  • Without activation functions, neural networks would be limited to learning linear transformations, severely restricting their representational power
  • Activation functions enable neural networks to learn and represent complex decision boundaries and mappings between inputs and outputs
    • They allow the network to capture intricate patterns and interactions in the data
    • Different activation functions have different properties and are suited for various tasks

Common Activation Functions

  • Sigmoid: Squashes the input to a value between 0 and 1, often used in the output layer for binary classification
    • Defined as: ฯƒ(x)=11+eโˆ’x\sigma(x) = \frac{1}{1 + e^{-x}}
    • Smooth and differentiable, but suffers from the vanishing gradient problem for extreme inputs
  • Hyperbolic Tangent (Tanh): Similar to sigmoid but outputs values between -1 and 1, providing a wider range
    • Defined as: tanhโก(x)=exโˆ’eโˆ’xex+eโˆ’x\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
    • Zero-centered, which can help with convergence during training
  • Rectified Linear Unit (ReLU): Outputs the input value if positive, and 0 otherwise, helping to alleviate the vanishing gradient problem
    • Defined as: ReLU(x)=maxโก(0,x)\text{ReLU}(x) = \max(0, x)
    • Simple and computationally efficient, but can lead to "dead" neurons if the input is consistently negative
  • The choice of activation function depends on the specific problem and the desired output range
    • Sigmoid is commonly used for binary classification in the output layer
    • ReLU is popular for hidden layers in deep networks due to its simplicity and effectiveness
    • Tanh can be used when negative outputs are desired or for zero-centering the activations

Forward and Backpropagation

Forward Propagation

  • Forward propagation is the process of passing input data through the neural network to obtain the output predictions
    • It involves computing the weighted sum of inputs for each neuron and applying the activation function
    • The outputs from one layer become the inputs for the neurons in the subsequent layer
  • In the input layer, the feature values of the input data are fed into the corresponding neurons
  • In each hidden layer, the weighted sum of inputs is calculated for each neuron: zj=โˆ‘iwijxi+bjz_j = \sum_{i} w_{ij} x_i + b_j
    • $w_{ij}$ is the weight connecting neuron $i$ in the previous layer to neuron $j$ in the current layer
    • $x_i$ is the output of neuron $i$ in the previous layer
    • $b_j$ is the bias term for neuron $j$ in the current layer
  • The activation function is applied to the weighted sum to produce the neuron's output: aj=f(zj)a_j = f(z_j)
    • $f$ is the activation function (e.g., sigmoid, ReLU, tanh)
  • The forward propagation process continues until the output layer is reached, where the final predictions or class probabilities are obtained

Backpropagation

  • Backpropagation is the algorithm used to train neural networks by updating the weights based on the error between the predicted and actual outputs
    • It enables the network to learn from its mistakes and improve its performance
  • During backpropagation, the error is calculated at the output layer using a loss function, such as mean squared error or cross-entropy
    • The loss function measures the discrepancy between the predicted outputs and the true labels
  • The error is then propagated backward through the network, using the chain rule to calculate the gradients of the weights with respect to the error
    • The gradients indicate the direction and magnitude of the weight updates needed to minimize the error
  • The weights are updated in the opposite direction of the gradients, with the goal of minimizing the error and improving the network's performance
    • The update rule for a weight $w_{ij}$ is: wij:=wijโˆ’ฮทโˆ‚Lโˆ‚wijw_{ij} := w_{ij} - \eta \frac{\partial L}{\partial w_{ij}}
    • $\eta$ is the learning rate, which determines the step size of the weight updates
    • $\frac{\partial L}{\partial w_{ij}}$ is the gradient of the loss function $L$ with respect to the weight $w_{ij}$
  • The backpropagation process is repeated iteratively, with the gradients and weight updates calculated for each training example or batch
    • The network learns to adjust its weights to minimize the error and improve its predictions over time

Optimization Algorithms for Training

Gradient Descent Variants

  • Gradient Descent is a fundamental optimization algorithm that updates the weights in the direction of the negative gradient of the loss function
    • It aims to find the set of weights that minimize the loss and improve the network's performance
  • Batch Gradient Descent calculates the gradients for the entire training dataset before updating the weights, which can be computationally expensive for large datasets
    • It uses the average gradient over all training examples, providing a stable but slow update
  • Stochastic Gradient Descent (SGD) updates the weights based on the gradients calculated from a single randomly selected training example, making it faster but noisier
    • It approximates the true gradient by using a single example, leading to more frequent but less precise updates
  • Mini-batch Gradient Descent strikes a balance by updating the weights based on the gradients calculated from a small batch of training examples
    • It provides a trade-off between the stability of batch gradient descent and the speed of stochastic gradient descent
    • The batch size is a hyperparameter that determines the number of examples used in each update

Advanced Optimization Techniques

  • Momentum is a technique that helps accelerate gradient descent by adding a fraction of the previous update vector to the current update, helping to overcome local minima and plateaus
    • It introduces a momentum term that accumulates the gradients over time, providing a "velocity" to the weight updates
    • The momentum hyperparameter $\alpha$ controls the contribution of the previous update to the current update
  • Adaptive optimization algorithms, such as Adagrad, RMSprop, and Adam, adapt the learning rate for each weight based on its historical gradients, improving convergence speed and performance
    • Adagrad adjusts the learning rate for each weight inversely proportional to the square root of the sum of its historical squared gradients
    • RMSprop modifies Adagrad to use an exponentially decaying average of the squared gradients, reducing the aggressive learning rate decay
    • Adam (Adaptive Moment Estimation) combines the ideas of momentum and adaptive learning rates, using both the first and second moments of the gradients
  • The choice of optimization algorithm depends on factors such as the size of the dataset, the complexity of the network architecture, and the specific problem domain
    • SGD with momentum is a common choice for large datasets and complex models
    • Adam is widely used due to its adaptability and good performance in many scenarios
    • Experimenting with different algorithms and hyperparameters is often necessary to find the best configuration for a given task