Deformable lung image registration with a Dice score over 0.9 with only 10 Epochs
Published: 01/12/23
Reinforcement Learninglg...
In this example we implement a simple Deep Reinforcement Learning solution for the classic CartPole problem. We implement a custom reward function and then we use three different strategies to strike a balance between exploration and exploitation and use DQN to approximate Q-values:
- Epsilon greedy exploration
- Noisy Networks
- Thompson Sampling
Published: 03/29/23
Image Classificationlg...
This is part of the Deep learning course series by DEQUE AI. A simple example of a vanilla neural network on FashionMNIST dataset.
Published: 07/19/23
Image Classificationlg...
Session 4 on CNN, ViT, GAN and Some practical concepts. Video and PDF attached with the published project.
Published: 07/26/23
Audio-to-Audio, Feature Extractionlg...
Demo of the Encodec paper proposes a model for audio compression, which takes an audio signal as input and produces a compressed representation of the signal.
Published: 08/03/23
In this tutorial, we will be finetuning a pre-trained Mask R-CNN_ model on the Penn-Fudan Database for Pedestrian Detection and Segmentation_. It contains 170 images with 345 instances of pedestrians, and we will use it to illustrate how to use the new features in torchvision in order to train an object detection and instance segmentation model on a custom dataset.
Published: 09/27/23
Object Detection, Image Segmentationlg...
Detectron2 is an open-source modular computer vision framework built on top of PyTorch. It's developed by the Facebook AI Research (FAIR) team and serves as a successor to the original Detectron platform. Detectron2 is designed specifically for research and development purposes in the domain of object detection and segmentation.
This is the official tutorial of detectron2. Here, we will go through some basics usage of detectron2, including the following:
Run inference on images or videos, with an existing detectron2 model
Train a detectron2 model on a new dataset
Published: 09/28/23
Object Detection, Feature Extraction, Multimodallg...
DeepSORT based Tracking is based on the following steps:
Detection: Before objects can be tracked in each frame, they must be detected. This is done using a standard object detector like Faster R-CNN.
Feature Extraction: Extract features from the detected objects. These features will help to match objects across frames.
Data Association: Match detected objects with tracked objects from previous frames using both the bounding box overlap (using the Hungarian algorithm) and the feature similarity.
Track Management: Update tracks or create/delete tracks as necessary.
Published: 09/29/23
Segment Anything Model (SAM): a new AI model from Meta AI that can "cut out" any object, in any image, with a single click.
The Segment Anything Model (SAM) produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks.
Published: 09/29/23
Deep Learning Fundamentals, Backpropagationlg...
Neural Network from Sratch
The code represents a neural network with functionalities to initialize the network, backpropagate the errors, update weights using gradient descent, and train using mini-batches. It also provides three utility functions for the cost, sigmoid activation, and its derivative.
Published: 10/05/23
Activation Functions, Deep Learning Fundamentalslg...
Activation functions play a crucial role in neural networks, performing a vital function in hidden layers to solve complex problems and to analyze and transmit data throughout deep learning algorithms. There are dozens of activation functions, including binary, linear, and numerous non-linear variants.
The activation function defines the output of a node based on a set of specific inputs in machine learning, deep neural networks, and artificial neural networks.
Published: 10/06/23
OneFormer is the first multi-task universal image segmentation framework based on transformers. OneFormer needs to be trained only once with a single universal architecture, a single model, and on a single dataset , to outperform existing frameworks across semantic, instance, and panoptic segmentation tasks!
OneFormer is the first multi-task universal image segmentation framework based on transformers.
OneFormer needs to be trained only once with a single universal architecture, a single model, and on a single dataset , to outperform existing frameworks across semantic, instance, and panoptic segmentation tasks.
OneFormer uses a task-conditioned joint training strategy, uniformly sampling different ground truth domains (semantic instance, or panoptic) by deriving all labels from panoptic annotations to train its multi-task model.
OneFormer uses a task token to condition the model on the task in focus, making our architecture task-guided for training, and task-dynamic for inference, all with a single model.
Published: 10/06/23
Image Segmentation, Object Detectionlg...
In this tutorial, you will learn:
- the basic structure of Mask R-CNN.
- to perform inference with a MMDetection detector.
- to train a new instance segmentation model with a new dataset.
Published: 10/06/23
Backpropagation, Deep Learning Fundamentals, Tensorslg...
Automatic Differentiation with ``torch.autograd``
=======================================
When training neural networks, the most frequently used algorithm is
back propagation. In this algorithm, parameters (model weights) are
adjusted according to the gradient of the loss function with respect
to the given parameter.
To compute those gradients, PyTorch has a built-in differentiation engine
called ``torch.autograd``. It supports automatic computation of gradient for any
computational graph.
Consider the simplest one-layer neural network, with input ``x``,
parameters ``w`` and ``b``, and some loss function. It can be defined in
PyTorch in the following manner:
Published: 10/06/23
Deep Learning Fundamentals, Backpropagation, Tensors, Activation Functionslg...
Build the Neural Network
===================
Neural networks comprise of layers/modules that perform operations on data.
The `torch.nn <https://pytorch.org/docs/stable/nn.html>`_ namespace provides all the building blocks you need to
build your own neural network. Every module in PyTorch subclasses the `nn.Module <https://pytorch.org/docs/stable/generated/torch.nn.Module.html>`_.
A neural network is a module itself that consists of other modules (layers). This nested structure allows for
building and managing complex architectures easily.
Published: 10/06/23
DeepLabV3 models with ResNet-50, ResNet-101 and MobileNet-V3 backbones
Published: 10/06/23
Optimizers, Tensors, Backpropagationlg...
Optimizing Model Parameters
===========================
Now that we have a model and data it's time to train, validate and test our model by optimizing its parameters on
our data. Training a model is an iterative process; in each iteration the model makes a guess about the output, calculates
the error in its guess (*loss*), collects the derivatives of the error with respect to its parameters (as we saw in
the `previous section <autograd_tutorial.html>`_), and **optimizes** these parameters using gradient descent. For a more
detailed walkthrough of this process, check out this video on `backpropagation from 3Blue1Brown <https://www.youtube.com/watch?v=tIeHLnjs5U8>`__.
Published: 10/06/23
Deep Learning Fundamentalslg...
Saving and Loading Model Weights
PyTorch models store the learned parameters in an internal
state dictionary, called ``state_dict``. These can be persisted via the ``torch.save``
method:
model = models.vgg16(weights='IMAGENET1K_V1')
torch.save(model.state_dict(), 'model_weights.pth')
To load model weights, you need to create an instance of the same model first, and then load the parameters
using ``load_state_dict()`` method.
Published: 10/06/23
Deep Learning Fundamentals, Tensorslg...
Tensor tutorial:
Tensors are a specialized data structure that are very similar to arrays and matrices.
In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model’s parameters.
Tensors are similar to `NumPy’s <https://numpy.org/>`_ ndarrays, except that tensors can run on GPUs or other hardware accelerators. In fact, tensors and
NumPy arrays can often share the same underlying memory, eliminating the need to copy data (see :ref:`bridge-to-np-label`). Tensors
are also optimized for automatic differentiation (we'll see more about that later in the `Autograd <autogradqs_tutorial.html>`__
section). If you’re familiar with ndarrays, you’ll be right at home with the Tensor API. If not, follow along!
Published: 10/06/23
Deep Learning Fundamentalslg...
Transforms
===================
Data does not always come in its final processed form that is required for
training machine learning algorithms. We use **transforms** to perform some
manipulation of the data and make it suitable for training.
All TorchVision datasets have two parameters -``transform`` to modify the features and
``target_transform`` to modify the labels - that accept callables containing the transformation logic.
The `torchvision.transforms <https://pytorch.org/vision/stable/transforms.html>`_ module offers
several commonly-used transforms out of the box.
The FashionMNIST features are in PIL Image format, and the labels are integers.
For training, we need the features as normalized tensors, and the labels as one-hot encoded tensors.
To make these transformations, we use ``ToTensor`` and ``Lambda``.
Published: 10/06/23
Deep Learning Fundamentalslg...
Dataloader Tutorial: A lot of effort in solving any machine learning problem goes into
preparing the data. PyTorch provides many tools to make data loading
easy and hopefully, to make your code more readable. In this tutorial,
we will see how to load and preprocess/augment data from a non trivial
dataset.
Published: 10/06/23
Natural Language Processing, Conversationallg...
In this tutorial, we explore a fun and interesting use-case of recurrent
sequence-to-sequence models. We will train a simple chatbot using movie
scripts from the `Cornell Movie-Dialogs
Corpus <https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html>`__.
Conversational models are a hot topic in artificial intelligence
research. Chatbots can be found in a variety of settings, including
customer service applications and online helpdesks. These bots are often
powered by retrieval-based models, which output predefined responses to
questions of certain forms. In a highly restricted domain like a
company’s IT helpdesk, these models may be sufficient, however, they are
not robust enough for more general use-cases. Teaching a machine to
carry out a meaningful conversation with a human in multiple domains is
a research question that is far from solved. Recently, the deep learning
boom has allowed for powerful generative models like Google’s `Neural
Conversational Model <https://arxiv.org/abs/1506.05869>`__, which marks
a large step towards multi-domain generative conversational models. In
this tutorial, we will implement this kind of model in PyTorch.
Published: 10/06/23
Profiling, Deep Learning Fundamentalslg...
Profiling your PyTorch Module
------------
**Author:** `Suraj Subramanian `
PyTorch includes a profiler API that is useful to identify the time and
memory costs of various PyTorch operations in your code. Profiler can be
easily integrated in your code, and the results can be printed as a table
or returned in a JSON trace file.
.. note::
Profiler supports multithreaded models. Profiler runs in the
same thread as the operation but it will also profile child operators
that might run in another thread. Concurrently-running profilers will be
scoped to their own thread to prevent mixing of results.
.. note::
PyTorch 1.8 introduces the new API that will replace the older profiler API
in the future releases. Check the new API at `this page <https://pytorch.org/docs/master/profiler.html>`__.
Head on over to `this
recipe <https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>`__
for a quicker walkthrough of Profiler API usage.
Published: 10/07/23
Knowledge Distillation, Deep Learning Fundamentals, Loss Functionslg...
Knowledge distillation is a technique that enables knowledge transfer from large, computationally expensive models to smaller ones without losing validity. This allows for deployment on less powerful hardware, making evaluation faster and more efficient.
In this tutorial, we will run a number of experiments focused at improving the accuracy of a lightweight neural network, using a more powerful network as a teacher. The computational cost and the speed of the lightweight network will remain unaffected, our intervention only focuses on its weights, not on its forward pass. Applications of this technology can be found in devices such as drones or mobile phones. In this tutorial, we do not use any external packages as everything we need is available in ``torch`` and ``torchvision``. In this tutorial, you will learn:
How to modify model classes to extract hidden representations and use them for further calculations. How to modify regular train loops in PyTorch to include additional losses on top of, for example, cross-entropy for classification. How to improve the performance of lightweight models by using more complex models as teachers.
Published: 10/07/23
Deep Learning Fundamentals, Transfer Learning, Image Classificationlg...
Transfer Learning for Computer Vision Tutorial
==============================================
**Author**: `Sasank Chilamkurthy`_
In this tutorial, you will learn how to train a convolutional neural network for
image classification using transfer learning. You can read more about the transfer
learning at `cs231n notes <https://cs231n.github.io/transfer-learning/>`__
Quoting these notes,
In practice, very few people train an entire Convolutional Network
from scratch (with random initialization), because it is relatively
rare to have a dataset of sufficient size. Instead, it is common to
pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 1.2 million images with 1000 categories), and then use the
ConvNet either as an initialization or a fixed feature extractor for
the task of interest.
These two major transfer learning scenarios look as follows:
- **Finetuning the ConvNet**: Instead of random initialization, we
initialize the network with a pretrained network, like the one that is
trained on imagenet 1000 dataset. Rest of the training looks as
usual.
- **ConvNet as fixed feature extractor**: Here, we will freeze the weights
for all of the network except that of the final fully connected
layer. This last fully connected layer is replaced with a new one
with random weights and only this layer is trained.
Published: 10/07/23
Text Generation, Deep Learning Fundamentalslg...
Language Modeling with ``nn.Transformer`` and torchtext
===============================================================
This is a tutorial on training a model to predict the next word in a sequence using the
`nn.Transformer <https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html>`__ module.
The PyTorch 1.2 release includes a standard transformer module based on the
paper `Attention is All You Need <https://arxiv.org/pdf/1706.03762.pdf>`__.
Compared to Recurrent Neural Networks (RNNs), the transformer model has proven
to be superior in quality for many sequence-to-sequence tasks while being more
parallelizable. The ``nn.Transformer`` module relies entirely on an attention
mechanism (implemented as
`nn.MultiheadAttention <https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html>`__)
to draw global dependencies between input and output. The ``nn.Transformer``
module is highly modularized such that a single component (e.g.,
`nn.TransformerEncoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html>`__)
can be easily adapted/composed.
Published: 10/08/23
Neural Transfer Using PyTorch
=============================
**Author**: `Alexis Jacq `_
**Edited by**: `Winston Herring`_
Introduction
------------
This tutorial explains how to implement the `Neural-Style algorithm <https://arxiv.org/abs/1508.06576>`__
developed by Leon A. Gatys, Alexander S. Ecker and Matthias Bethge.
Neural-Style, or Neural-Transfer, allows you to take an image and
reproduce it with a new artistic style. The algorithm takes three images,
an input image, a content-image, and a style-image, and changes the input
to resemble the content of the content-image and the artistic style of the style-image.
Published: 10/08/23
Deep Learning Fundamentals, Image-to-Imagelg...
Exporting a Model from PyTorch to ONNX and Running it using ONNX Runtime
.. note::
As of PyTorch 2.1, there are two versions of ONNX Exporter.
* ``torch.onnx.dynamo_export`is the newest (still in beta) exporter based on the TorchDynamo technology released with PyTorch 2.0.
* ``torch.onnx.export`` is based on TorchScript backend and has been available since PyTorch 1.2.0.
In this tutorial, we describe how to convert a model defined
in PyTorch into the ONNX format using the TorchScript ``torch.onnx.export` ONNX exporter.
The exported model will be executed with ONNX Runtime.
ONNX Runtime is a performance-focused engine for ONNX models,
which inferences efficiently across multiple platforms and hardware
(Windows, Linux, and Mac and on both CPUs and GPUs).
ONNX Runtime has proved to considerably increase performance over
multiple models as explained `here
<https://cloudblogs.microsoft.com/opensource/2019/05/22/onnx-runtime-machine-learning-inferencing-0-4-release>`__
For this tutorial, you will need to install `ONNX <https://github.com/onnx/onnx>`__
and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`__.
You can get binary builds of ONNX and ONNX Runtime with
Published: 10/08/23
Natural Language Processing, Deep Learning Fundamentals, Sentence Similaritylg...
A simple example of embeddings. In the context of word embeddings and methods like Word2Vec, two words will be near each other in the embedding space primarily if they show up in similar contexts in sentences, not necessarily because they have similar meanings.
An n-gram is a continuous sequence of 'n' items from a given sample of text or speech. It is commonly used in text processing and statistics to predict the next item in a sequence. For example, in the sentence "I love to play," the 2-grams (or bigrams) would be: "I love," "love to," and "to play."
The underlying principle is the distributional hypothesis, which states that words that occur in the same contexts tend to have similar meanings. So, while the primary mechanism driving the positioning of words in the embedding space is their context, there's an indirect implication about their semantic similarity.
Published: 10/10/23
Natural Language Processinglg...
Using the Embedding layer¶
Keras makes it easy to use word embeddings. Take a look at the Embedding layer.
The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.
Published: 10/10/23
Deep Learning Fundamentalslg...
In this tutorial, we will learn how to use multiple GPUs using ``DataParallel``.
It's very easy to use GPUs with PyTorch. You can put the model on a GPU:
.. code:: python
device = torch.device("cuda:0")
model.to(device)
Then, you can copy all your tensors to the GPU:
.. code:: python
mytensor = my_tensor.to(device)
Published: 10/10/23
Deep Learning Fundamentals, Backpropagationlg...
This code demonstrates how to implement a convolutional layer (specifically 2D convolution) using SciPy's functions and then integrates it into PyTorch as a custom layer. It's important to understand that although deep learning literature often refers to this as "convolution", the operation being performed is technically "cross-correlation".
Here's a step-by-step breakdown:
What is implemented? Cross-correlation with learnable weights: Cross-correlation is similar to convolution but without flipping the filter. The code defines a custom layer with learnable filter (or kernel) weights. Backward pass for gradients computation: The backward pass is implemented to compute gradients with respect to both the input and the filter. This is crucial for training neural networks. How is it implemented?
ScipyConv2dFunction class: forward: Performs the cross-correlation operation using correlate2d from SciPy. It takes in an input tensor and a filter tensor, performs the operation, and then adds a bias. The results are then saved for the backward pass. backward: Computes gradients for the input, filter, and bias using convolve2d and correlate2d functions.
ScipyConv2d class (Module): Inherits from PyTorch's Module class. Defines the filter and bias as learnable parameters. In its forward method, it calls the ScipyConv2dFunction to perform the operation.
Example usage: A ScipyConv2d module is instantiated with a filter size of 3x3. Random 10x10 input is passed through this module. Backward pass is performed to compute gradients.
Gradient check: PyTorch provides gradcheck utility to numerically check the gradients computed during the backward pass. It's a valuable tool to ensure that custom implementations are correct. The last part of the code uses gradcheck to verify the gradients of the custom convolution operation.
Visual Explanation: Imagine having an image (input) and a small filter (like a tiny image). Cross-correlation involves sliding this filter over the image and computing the sum of element-wise products at each position. This process produces a new matrix (output). The idea behind using filters is to detect patterns or features in the input. For example, a filter might be good at detecting edges in an image.
In neural networks, the values of the filters are learnable. So, during training, the network adjusts these values to detect patterns that are most useful for a given task, say image classification.
The backward pass involves computing how much each pixel in the input and each value in the filter should change to minimize the error in the network's prediction. This is done using gradients, which tell us the direction and magnitude of the required change.
This code is essentially defining this entire process, but instead of using PyTorch's built-in convolution, it uses SciPy's functions and then wraps them in a PyTorch-compatible manner
Published: 10/10/23
Deep Learning Fundamentals, Tensorslg...
The code provided is an experiment to compare the speed of matrix multiplication on different platforms:
Setting up the Data:
Two random tensors, x and y, of sizes (1, 6400) and (6400, 5000) respectively, are created using PyTorch.
GPU Computation:
The code checks if CUDA (used for NVIDIA GPUs) is available with torch.cuda.is_available().
An assertion ensures that the current device is 'cuda', which means that the GPU is being used.
The tensors, x and y, are transferred to the GPU with .to(device).
The %timeit command measures the time it takes to do matrix multiplication of x and y on the GPU using the @ operator.
CPU Computation (with PyTorch):
The tensors are transferred back to the CPU.
The %timeit command measures the time taken for matrix multiplication on the CPU.
CPU Computation (with NumPy):
Two random arrays, x and y, of the same sizes as before, are created using NumPy.
The %timeit command then measures the time taken to multiply these arrays using NumPy's matmul function.
In essence, this code is demonstrating the speed difference between performing matrix multiplications on a GPU versus a CPU, and also between PyTorch and NumPy on a CPU.
Published: 10/11/23
Deep Learning Fundamentalslg...
The code performs the following tasks:
Setting up Data: A set of input (x) and target (y) data points is defined. These are converted into PyTorch tensors X and Y and cast to floating-point type.
Determining the Device: The code checks if a GPU with CUDA is available and sets the device as either 'cuda' (GPU) or 'cpu'. The tensors X and Y are then transferred to the specified device.
Defining a Neural Network: A simple feed-forward neural network MyNeuralNet is defined with an input layer, one hidden layer with ReLU activation, and an output layer.
Model Initialization & Loss Calculation: The random seed for PyTorch is set for reproducibility. An instance of the neural network is created and transferred to the device. Mean squared error (MSE) loss between the model's prediction (_Y) and the target values (Y) is computed and printed.
Training using Stochastic Gradient Descent (SGD): The SGD optimizer is initialized with a learning rate of 0.001. The model is trained for 50 epochs. In each epoch, the gradients are zeroed, a forward pass is done, the loss is computed, and backpropagation is performed to adjust the model's weights. The loss for each epoch is stored in the loss_history list.
Visualizing the Training Loss: Using matplotlib, the loss values over the 50 epochs are plotted. This visualization helps in understanding how well the model is learning.
Modifying the Neural Network: The MyNeuralNet class is redefined to return not only the output of the network but also the output of the hidden layer. This network is then trained in a manner similar to the initial training process. Loss over 50 epochs is plotted again to visualize the training progress.
Inspecting the Hidden Layer Output: The output of the hidden layer for the input tensor X is retrieved and printed, providing insights into the intermediate representations the neural network has learned.
Overall, the code demonstrates how to set up, define, and train a simple neural network using PyTorch, and how to visualize the training process using matplotlib. The modifications made to the neural network in the latter half of the code allow for a deeper inspection of the network's inner workings, specifically the output from the hidden layers.
Published: 10/11/23
Deep Learning Fundamentals, Loss Functionslg...
Data Preparation: Lists x and y are defined, representing input and target data respectively. These lists are converted into PyTorch tensors X and Y and are set to floating-point type.
Device Configuration: The code checks if a CUDA-enabled GPU is available for computation. If available, device is set to 'cuda'; otherwise, it's set to 'cpu'. The tensors X and Y are then transferred to the chosen device.
Dataset and DataLoader Creation: A custom dataset class MyDataset is defined using PyTorch's Dataset class. This custom dataset handles the input and target data for training. An instance of the dataset ds is created using X and Y.
A DataLoader dl is defined with a batch size of 2 and shuffling enabled. This DataLoader will be used to fetch data in batches during training.
Neural Network Definition: A feed-forward neural network MyNeuralNet is defined with: An input layer. A hidden layer with ReLU activation. An output layer. An instance of this network, mynet, is created and transferred to the chosen device (either CPU or CUDA).
Loss Functions: Two methods to calculate the mean squared error loss are presented:
PyTorch’s built-in MSELoss function.
A custom function named my_mean_squared_error. The loss value using PyTorch's built-in function is computed and printed.
Intermediate Representations: The intermediate representations of the input data as it passes through the network's layers are extracted: After the input layer with the input_to_hidden_layer. After the hidden layer activation function with the hidden_layer_activation.
Throughout the code, there's an emphasis on creating a neural network and setting up the necessary components for training, such as data handling with datasets and loaders, defining the model, and calculating loss.
Published: 10/11/23
Deep Learning Fundamentalslg...
In this simple pytorch network with dataset and data loader our goal is to emphasize the significance of batch size and how it can improve performance like accuracy and also impact memory consumption or GPU utilization. We will use few different batch sizes in the training loop and plot the metrics for each batch size as a way to compare.
Published: 10/11/23
Optimizers, Deep Learning Fundamentalslg...
When training a neural network, the choice of optimizer can have a significant impact on the training dynamics and the final performance of the model. SGD (Stochastic Gradient Descent) and Adam are two popular optimizers, and they have different characteristics:
Basic Differences:
SGD: This is the classical version of gradient descent optimization where the model updates its parameters in the direction of the negative gradient.
Adam (Adaptive Moment Estimation): Combines the ideas of Momentum (moving average of gradients) and RMSprop (moving average of squared gradients) to adjust the learning rate for each parameter individually.
Learning Rate:
SGD: Typically uses a constant learning rate, although there are variants with adaptive learning rates.
Adam: Computes adaptive learning rates for different parameters from estimates of the first and second moments of the gradients. This often leads to faster convergence.
Noise:
SGD: Updates can be noisy (especially in the case of pure SGD without any momentum), which can be beneficial because this noise can help escape shallow local minima. However, it may also lead to slower convergence.
Adam: Due to its adaptive nature, it tends to be more stable than pure SGD. However, this can sometimes lead to premature convergence or getting stuck in sharp minima, which might not generalize well.
Validation Accuracy Dynamics:
SGD: Can lead to smoother curves in terms of validation accuracy because of its consistent update rule.
Adam: Given its adaptive nature, sometimes the updates can be aggressive, leading to oscillations or choppier curves in terms of validation accuracy.
Generalization:
There's ongoing research in deep learning that sometimes suggests models trained with SGD generalize better than those trained with adaptive methods like Adam, especially when trained with proper regularization and learning rate schedules. The noise introduced by SGD can act as a form of implicit regularization.
Convergence Speed:
In many cases, especially in the early stages of training, Adam can converge much faster than SGD because of its adaptive properties. However, SGD, with a well-tuned learning rate (or learning rate schedule), might lead to better generalization in the long run.
In Summary: The choppier validation accuracy curve observed with Adam compared to SGD could be attributed to Adam's adaptive learning rate adjustments, which can sometimes cause oscillations in performance. However, the choice between Adam and SGD should be based on the specific problem, dataset, and the goals of training. Sometimes, a combination of the two (e.g., starting training with Adam and then switching to SGD) can be effective. Always validate with your own experiments!
Published: 10/11/23
Learning Rate, Deep Learning Fundamentalslg...
The learning rate is one of the most critical hyperparameters in training neural networks and can significantly affect the training dynamics and model performance. It essentially dictates how much we adjust the model in response to the estimated error at each update.
Let's dive into the effects of different learning rates:
Large Learning Rate (e.g., 1e-2): Training Accuracy: The model parameters can change drastically in each update. This can lead to faster convergence, but it can also cause the model to overshoot the optimal points in the parameter space and become unstable. Validation Accuracy: Due to the large jumps in the parameter space, the model might not settle down to a good generalizable point, leading to potentially poorer validation performance. Training Dynamics: The loss curve can be very noisy and erratic. There's a risk of diverging (i.e., the loss goes to infinity) if the learning rate is too high.
Moderate Learning Rate (e.g., 1e-3): Training Accuracy: Often considered a good middle-ground, the model can learn efficiently without taking overly aggressive steps. Validation Accuracy: The model can usually generalize better because it's taking measured steps towards minima, making it likely to find a reasonable point in the parameter space. Training Dynamics: The loss curve is smoother than with a large learning rate. Convergence is typically stable.
Small Learning Rate (e.g., 1e-5): Training Accuracy: The model updates very conservatively. This can lead to very slow convergence, and it might not reach a satisfactory performance level within a reasonable number of epochs. Validation Accuracy: If given enough time (many epochs), it might eventually generalize well, but there's also a risk of getting stuck in shallow local minima or plateaus in the loss landscape. Training Dynamics: The loss curve will be very smooth, but the downside is the risk of extremely slow convergence.
Other Considerations:
Initial Phase vs. Late Phase: Sometimes, it's beneficial to start with a larger learning rate to quickly progress in the early stages of training and then reduce it in later stages to refine the model parameters. This strategy is often implemented using learning rate schedules or policies like step decay, exponential decay, or one-cycle learning.
Adaptive Learning Rate Algorithms: Some optimization algorithms, like Adam, adjust the learning rate based on the recent history of gradients, which can sometimes mitigate the need for manual tuning of the learning rate. However, even in such cases, the initial learning rate and how it's adjusted can play a significant role.
In Summary: The learning rate dictates the step size during training. Too large, and you risk overshooting minima and unstable training. Too small, and you might face slow convergence or getting stuck. Properly tuning the learning rate, potentially using learning rate schedules, can be key to efficient and effective training of neural networks.
Published: 10/11/23
Deep Learning Fundamentalslg...
Layers in a neural network are a foundational concept, and the depth of a neural network (i.e., the number of layers) can have a profound impact on its performance and characteristics. Let's dive in:
Significance of Layers in a Neural Network:
- Representation Learning: Each layer in a neural network can be thought of as learning a representation of the data. In the context of deep learning, especially in convolutional neural networks (CNNs) used for image processing, the initial layers might learn to detect edges, the middle layers might learn to recognize textures or shapes, and the deeper layers might recognize more complex structures or objects. Thus, as we move deeper into the network, the representations become more abstract.
- Function Composition: Neural networks are essentially function approximators. Having multiple layers allows the network to represent a composition of functions. This composition can capture intricate patterns and relationships in the data.
- Hierarchy of Features: The hierarchical structure of deep neural networks allows them to build up a hierarchy of features from simple to complex. This is especially beneficial for tasks like image and speech recognition.
Deeper Neural Networks vs. Shallow Neural Networks:
- Capacity: Deeper networks have more parameters and, therefore, a greater capacity to learn from data. This can be advantageous for complex tasks with large datasets.
- Feature Learning: Deep networks can learn a hierarchy of features. For example, in image recognition, initial layers might detect edges, middle ones might detect shapes, and deeper layers might detect complex objects. This hierarchical feature learning often isn't achievable with shallow networks.
- Training Challenges: Training deeper networks can be more challenging due to issues like vanishing and exploding gradients. Techniques like batch normalization, skip connections (like in ResNet), and improved initialization methods have been introduced to help train very deep networks.
- Overfitting: While deeper networks can model complex functions, they are also more prone to overfitting, especially when the amount of training data is limited. Regularization techniques (like dropout) become crucial in such scenarios.
- Computational Complexity: Deeper networks usually require more computational power and memory. They might have longer training times compared to shallow networks.
- Transfer Learning: Pre-trained deep networks (trained on tasks with a large amount of data like ImageNet) can be fine-tuned for different tasks with limited data. This takes advantage of the hierarchical feature learning capability of deep networks.
- Diminishing Returns: After a certain depth, adding more layers might not lead to performance improvements and, in some cases, can even hurt performance. This is task-dependent, and finding the right depth often involves experimentation.
Published: 10/11/23
Dataloaders, Deep Learning Fundamentalslg...
FashionMNIST is a dataset of Zalando's article images, consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
The necessary libraries are imported. datasets from torchvision will be used to fetch the FashionMNIST dataset.
The FashionMNIST dataset is downloaded to the directory specified in data_folder. The train=True argument means the training set is being fetched.
Various details about the dataset, such as the shapes of the images and targets, unique labels, and class names, are printed.
This code uses matplotlib to visualize the dataset. For each unique class, 10 random images are displayed in a grid. The total grid size will be the number of unique classes by 10 columns.
Here's a quick overview of what this code accomplishes:
It downloads the FashionMNIST dataset.
It extracts the image data and their labels.
It prints information about the dataset.
It visualizes 10 random images for each class from the dataset.
Published: 10/24/23
Deep Learning Fundamentalslg...
Chain Rule
Calculate the updated weight value using Chain rule
The chain rule is a fundamental concept in calculus and plays an essential role in training neural networks using backpropagation.
Chain Rule in Calculus: In simple terms, the chain rule provides us a technique to differentiate composite functions.
Suppose we have two functions y = g ( u ) and u = f ( x ) . The composite function is y = g ( f ( x ) ) .
The derivative of y with respect to x is found as:
d y/d x = d y/d u × du/d x
That is, you can find the rate of change of the outer function with respect to the inner function and multiply it by the rate of change of the inner function with respect to the independent variable.
Visual Explanation: Imagine you're driving your car on a hilly road. You can think of the road's curve as a function. Now, the speed at which you're driving represents a second function, representing how your speed changes as you drive along the curve.
Now, you want to know how your speed will change (acceleration) when you reach a particular steep part of the hill. For this, you'd first find out how much steeper that part is compared to the rest of the hill (the slope or derivative of the hill's curve). Next, you'd determine how your speed changes in response to this steepness.
Published: 10/24/23
Deep Learning Fundamentals, Learning Ratelg...
Annealing the Learning Rate:
Learning rate annealing refers to the practice of gradually decreasing the learning rate during training. It combines the benefits of both high and low learning rates. Starting with a higher learning rate can expedite initial learning, helping the model to escape from any poor local minima. Then, as training progresses, reducing the learning rate can help the model to converge to a more optimal solution in the loss landscape.
There are several strategies for annealing the learning rate:
- Step decay: Reduce the learning rate by a factor after a specified number of epochs.
- Exponential decay: Reduce the learning rate exponentially, epoch by epoch.
- ReduceLROnPlateau: Monitor a metric (like validation loss), and reduce the learning rate when the metric stops improving.
- Cosine annealing: Reduce the learning rate following a part of the cosine curve.
- Cyclic learning rates: Instead of monotonically decreasing the learning rate, increase and decrease it cyclically within a range.
When to use learning rate annealing:
- Deep Networks: Deeper architectures with many parameters tend to benefit from learning rate annealing, as they have more complex loss surfaces.
- Training from scratch: When a model is trained from scratch (as opposed to fine-tuning), annealing can be especially beneficial.
- To achieve higher accuracy: When squeezing out every bit of performance is essential, annealing can help the model converge to a slightly better minima.
When not to use it:
- Short trainings: For very short training sessions or for models with a small amount of data, the effect of annealing might not be evident.
- Transfer Learning/Fine-tuning: When you're fine-tuning a pre-trained model for a few epochs, the benefits of annealing might be minimal since the model is already starting from a good position.
- Additional Complexity: Annealing adds another dimension to hyperparameter tuning. In some scenarios, the added complexity might not be worth the potential gain.
Published: 10/11/23
Deep Learning Fundamentalslg...
Deep Learning Task Categories
Deep Learning is a powerful subset of machine learning, capable of handling diverse tasks. Below are the key categories, including those involving images and audio:
Published: 11/24/23
Natural Language Processinglg...
A parameter in the context of neural networks and language models refers to the elements of the model that are learned from the training data.
Published: 11/24/23