100 Deep Learning Terms Explained

Harshit Ahluwalia 24 Apr, 2024 • 15 min read


Ever felt overwhelmed by the jargon of deep learning? You’re not alone! This field is packed with powerful concepts, but remembering every term can be challenging.

This glossary is here to bridge the gap. further in this article, we will explore 100 essential deep learning terms, making complex ideas approachable and empowering you to navigate this exciting field. 

So, let’s get straight into the article and understand the deep learning terms!

Deep Learning Terms

Why You Should be Well Versed with Deep Learning Terms?

Understanding the language of deep learning is super important in keeping up with the latest in this fast-moving field. It helps us wrap our heads around tricky concepts, keeps us in the loop with new discoveries, lets us share knowledge effectively, and makes it easier to read and make sense of research papers and technical docs. Plus, it’s a big help when trying to solve tough problems, build and troubleshoot models, and talk shop with folks from all backgrounds. Basically, mastering the deep learning terms means we can communicate, avoid mix-ups, and make a difference in this exciting tech area.

100 Deep Learning Terms You Must Know

Here are 100 deep learning terms that you must know:

1. Artificial Neural Network (ANN)

ANN stands for Artificial Neural Network. In data science, it refers to a computational model loosely inspired by the structure and function of the human brain.

2. Activation Function

The activation function calculates a weighted total and then adds bias to decide whether a neuron should be activated. It aims to introduce non-linearity into a neuron’s output. Examples include sigmoid, ReLU (Rectified Linear Unit), and tanh.

3. Backpropagation

In neural networks, if the estimated output is far from the actual output (high error), we update the biases and weights based on the error. This weight and bias updating process is known as Back Propagation. Back-propagation (BP) algorithms determine the loss (or error) at the output and then propagate it back into the network. The weights are updated to minimize the error resulting from each neuron. The first step in reducing the error is determining the gradient (Derivatives) of each node w.r.t. the final output.

4. Convolutional Neural Network (CNN)

Convolutional Neural Networks (CNNs) are a powerful type of deep learning model that excels at processing data with a grid-like structure, primarily images.  They are inspired by how the human visual cortex functions and are particularly adept at tasks like image recognition, object detection, and image segmentation.

5. Deep Learning

Deep Learning is associated with a machine learning algorithm (Artificial Neural Network, ANN), which uses the human brain concept to facilitate modeling arbitrary functions. ANN requires a vast amount of data, and this algorithm is highly flexible when it comes to modeling multiple outputs simultaneously. To understand ANN in detail, read here.

6. Epoch

This deep learning term – epoch, refers to a single complete pass of the training dataset through a machine learning model.  Imagine a loop where you train the model on all your data points once.  Each completion of that loop is considered an epoch.

7. Feature Extraction

Feature extraction refers to transforming raw data into numerical features that can be processed while preserving the information in the original data set.

8. Gradient Descent

Gradient descent is a first-order iterative optimization algorithm used to find the minimum of a function. We use a gradient descent algorithm in machine learning algorithms to minimize the cost function. It finds out the best set of parameters for our algorithm. Gradient Descent can be classified as follows:

  • On the basis of data ingestion:
    • Full Batch Gradient Descent Algorithm
    • Stochastic Gradient Descent Algorithm

In full batch gradient descent algorithms, we use whole data at once to compute the gradient, whereas in stochastic, we take a sample while computing the gradient.

  • On the basis of differentiation techniques:
    • First order Differentiation
    • Second order Differentiation

9. Loss Function

A function that measures how well the neural network models the expected outcome. 

10. Recurrent Neural Network (RNN)

RNN stands for Recurrent Neural Network. Unlike traditional ANNs that process data point by point, RNNs are specifically designed to handle sequential data, where the order of information matters.

11. Transfer Learning

Transfer learning is applying a pre-trained model to a completely new dataset. A pre-trained model is a model created by someone to solve a problem. This model can be applied to solve a similar problem with similar data.

Here, you can check some of the most widely used pre-trained models.

12. Weight

A parameter within a neural network that transforms input data within the network’s layers. It is adjusted during training so that the network predicts the correct output.

13. Bias

A term added to a model’s output that allows the model to represent patterns that do not pass through the origin.

14. Overfitting

A model is said to overfit when it performs well on the training dataset but fails on the test set. This happens when the model is too sensitive and captures random patterns that are present only in the training dataset. There are two methods to overcome overfitting:

  • Reduce the model complexity
  • Regularization

15. Underfitting

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. It refers to a model that neither models on the training data nor generalizes to new data. An underfit model is unsuitable as it will perform poorly on the training data.

16. Regularization

Regularization is a technique used to solve the overfitting problem in statistical models. In machine learning, regularization penalizes the coefficients so that the model can be generalized better. Different regression techniques use regularization, such as Ridge regression and lasso regression.

17. Dropout

A regularization technique for neural networks that prevents overfitting by randomly setting a fraction of input units to zero at each update during training.

18. Batch Normalization

A technique to improve the training of deep neural networks that normalizes the inputs to a layer for each mini-batch.

19. Autoencoder

A type of neural network used to learn efficient codings of unlabeled data, typically for dimensionality reduction.

20. Generative Adversarial Network (GAN)

Generative Adversarial Network (GAN): Ian Goodfellow and his colleagues designed a class of machine learning frameworks where two neural networks compete in a game.

21. Attention Mechanism

A component in complex neural networks, particularly in sequence-to-sequence models, allows the network to focus on different parts of the input sequentially rather than considering the whole input simultaneously, improving the performance in tasks like machine translation.

22. Embedding Layer

This deep learning term is used primarily in neural networks for processing text data, an embedding layer transforms sparse categorical data, typically indices of words, into a dense and continuous vector space where similar values are close to each other, facilitating more effective learning.

23. Multilayer Perceptron (MLP)

A type of neural network consists of at least three layers of nodes: an input layer, one or more hidden layers, and an output layer. Unlike CNNs or RNNs, MLPs are fully connected, meaning each neuron in one layer connects to every neuron in the following layer.

24. Normalization

A process in data preparation that changes the range of pixel intensity values to ensure that they are more consistent, typically by ensuring the mean and the standard deviation of the inputs are 0 and 1, respectively.

25. Pooling Layer

This deep learning term is often used in convolutional neural networks. Pooling (or subsampling or down-sampling) reduces the dimensions of the data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer, commonly using max or average pooling methods.

26. Sequence-to-Sequence Model

A model comprises two parts: an encoder that processes the input and a decoder that generates the output. It’s helpful in applications where input and output are sequences, such as machine translation or speech recognition.

27. Tensor

A generalized matrix is used as the basic data structure in TensorFlow and other deep learning frameworks to represent all data: a scalar is a zero-dimension tensor, a vector is a one-dimension tensor, and a matrix is a two-dimensional tensor.

28. Backbone Network

A pre-trained network is used as the base of another task-specific architecture, often for feature extraction in tasks like object detection, where the high-level features from the backbone are used to make predictions.

29. Fine-tuning

The process of taking a pre-trained deep learning model (the network has already been trained on a related task) and continuing the training on a new dataset specific to a second task, which can be smaller in size, leveraging the learned features.

30. Hyperparameters

Parameters that define the network architecture (like number of layers, number of nodes per layer, learning rate) and aspects of the training process (like batch size, number of epochs), which are set before training and directly control the behavior of the training algorithm.

31. Learning Rate

The size of the training algorithm’s step on the loss surface. A smaller learning rate might make the training more reliable but also make it slower to converge.

32. Softmax Function

This deep learning term is the final activation function in a neural network used for multi-class classification, which converts the output logits into probabilities by dividing the exponential of each output by the sum of the exponentials of all outputs.

33. Long Short-Term Memory (LSTM)

A special kind of RNN is capable of learning long-term dependencies, including gates that regulate the flow of information.

34. Vanishing Gradient Problem

A challenge in training deep neural networks is where gradients, during backpropagation, get smaller and smaller as they are propagated back through the layers, leading to very slow or stalled learning in layers close to the input.

35. Exploding Gradient Problem

A problem where large error gradients accumulate and result in very large updates to neural network model weights during training, potentially causing the model to fail to converge or even to diverge.

36. Data Augmentation

Techniques are used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data, such as rotating, flipping, scaling, or cropping images in the context of image processing.

37. Batch Size

The number of training examples utilized in one iteration (a single batch) of the model training.

38. Optimizer

An algorithm or method is used to change the neural network’s attributes, such as weights and learning rate, to reduce the losses. Common optimizers include SGD (Stochastic Gradient Descent), Adam, and RMSprop.

39. F1 Score

A measure of a test’s accuracy and considers both the precision and the recall of the test to compute the score: 2 * (precision * recall) / (precision + recall). It is particularly useful when the class distribution is uneven.

40. Precision

A metric that quantifies the number of correct positive predictions made. It is defined as the number of true positives divided by the number of true positives plus the number of false positives.

41. Recall

This deep learning term is also known as sensitivity and recall, which quantifies the number of correct positive predictions made out of all positive predictions that could have been made. It is calculated as the number of true positives divided by the number of true positives plus the number of false negatives.

42. ROC Curve

A graphical plot illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied by plotting the true positive rate (Recall) against the false positive rate.

43. Area Under the Curve (AUC)

In machine learning, AUC determines which models predict the classes best. It is the area under the ROC curve; a higher AUC indicates a better-performing model.

44. Early Stopping

Regularization is used to avoid overfitting when training learners with an iterative method, such as gradient descent. Training is stopped as soon as the performance on a validation dataset starts to degrade.

45. Feature Scaling

A method used to standardize the range of independent variables or features of data. Data processing is also known as data normalization and is generally performed during the data preprocessing phase.

46. Generative Model

A type of statistical model is used to generate all values in a data distribution, both those that are observed and unobserved. Common examples in deep learning include Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs).

47. Discriminative Model

A model that classifies input data; that is, it predicts the label of given inputs based on the training data. Common examples include most supervised learning models, such as logistic regression and neural networks.

48. Data Imbalance

A situation in a dataset where the number of observations per class is not equally distributed. Typically, this poses a challenge for predictive modeling as most algorithms are designed to maximize overall accuracy.

49. Dimensionality Reduction

Reducing the number of random variables under consideration is done by obtaining a set of principal variables. Techniques such as PCA (Principal Component Analysis), t-SNE, and autoencoders are often used.

50. Principal Component Analysis (PCA)

A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

51. Nonlinear Activation Functions

Functions used in neural networks that help the model learn complex data patterns include sigmoid, tanh, and ReLU (Rectified Linear Unit) functions.

52. Batch Training

A training methodology in neural networks where the model weights are updated after processing the entire dataset rather than individual data points or small batches.

53. Stochastic Gradient Descent (SGD)

A simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Unlike batch gradient descent, which calculates the gradient from the entire dataset, SGD updates the parameters using only one data point at a time.

54. Activation Maps

Visual representations of the specific activations within various layers of a deep learning model, typically within a CNN. These maps can help in understanding which features of the input data are activating certain filters or neurons.

55. Zero-Shot Learning

A classification problem where none of the classes in the test set have been seen during training; the model has to generalize from those seen to unseen classes.

56. One-Shot Learning

A classification task where the learning algorithm only gives a single example of each class before making predictions about new instances.

57. Few-Shot Learning

An approach to machine learning where the model is trained with a very small amount of labeled data, typically one to five examples per class.

58. Adversarial Examples

Slightly modified inputs were created to fool a machine-learning model. These are typically used to evaluate the robustness of models in tasks such as image classification.

59. Capsule Networks (CapsNets)

A type of deep neural network that tries to capture spatial hierarchies between features through capsules, groups of neurons that learn to recognize objects and their relative relationships in space, potentially overcoming some limitations of CNNs.

60. Attention Layers

Layers commonly used in sequence prediction problems help the model focus on specific parts of the input sequence, improving the model’s ability to remember long sequences without data loss.

61. Skip Connections

A technique used in designing deep neural networks to mitigate the vanishing gradient problem by skipping one or more layers. Commonly found in architectures like ResNet, where outputs from an earlier layer are added to outputs of a later layer to help preserve the gradient.

62. Siamese Networks

A neural network architecture that contains two or more identical subnetworks. Siamese networks are ideal for tasks that involve finding the similarity or relationship between two comparable things, such as in face verification systems.

63. Triplet Loss

A loss function is used to learn useful embeddings by comparing a baseline input to a positive input (similar) and a negative input (dissimilar). It ensures that the baseline input is closer to the positive input than the negative input by some margin.

64. Self-Supervised Learning

A type of machine learning where the training data provides the supervision, as the input data itself is used to generate labels. This is commonly used in scenarios where labeled data is scarce or expensive.

65. Cross-Entropy Loss

A loss function is often used in classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.

66. Sequence Modeling

A type of model in deep learning designed to handle sequential data such as time series or text. Examples include RNNs, LSTMs, and GRUs, which can learn from the temporal structure of data.

67. Spatial Transformer Networks

A CNN module that explicitly allows the spatial manipulation of data within the network. This can improve the geometric invariance of the model, as it can spatially transform feature maps to focus on relevant regions within the data.

68. Teacher Forcing

A technique used in training RNNs where the target output from the previous time step is used as the current input rather than the output generated by the network. This method helps stabilize and speed up training.

69. Neural Style Transfer

An algorithm that blends two images—the content of one and the artistic style of another—using convolutional neural networks. This process allows the model to learn and apply one image’s stylistic elements to another’s content.

70. Label Smoothing

A technique used to make the model less confident about its predictions by changing the way labels are represented. Instead of using hard labels (1s and 0s), label smoothing uses values slightly less than 1 and greater than 0, often leading to improved model generalization.

71. Lookahead Optimizer

A type of optimizer that periodically updates the model weights by interpolating between the current weights and the weights from several steps ago, helping to stabilize the optimization trajectories.

An algorithm used to improve the quality of predictions in sequence modeling, particularly in natural language processing. Instead of predicting the most likely next step at each step, it keeps track of the k most likely sequence paths.

73. Knowledge Distillation

A method where a smaller model, referred to as the “student,” is trained to reproduce the behavior of a much larger pre-trained model, or the “teacher.” This technique allows the deployment of powerful models in resource-constrained environments.

74. T-SNE (t-Distributed Stochastic Neighbor Embedding)

A machine learning algorithm for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. It converts affinities of data points to probabilities and minimizes the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

75. Gradient Clipping

A technique used to counter the exploding gradient problem during training. It involves clipping the gradients during backpropagation to prevent them from exceeding a defined threshold.

76. Meta-Learning

Sometimes referred to as “learning to learn,” it involves training a model on various learning tasks such that it can solve new learning tasks using only a small number of training samples.

77. Neural Architecture Search (NAS)

An area of machine learning that focuses on automating the design of artificial neural networks. NAS uses reinforcement learning, evolutionary algorithms, or gradient-based methods to generate optimal architectures for a given task.

78. Quantization

The process of reducing the number of bits representing the numbers in a neural network. Quantization reduces the model size and increases inference speed, making it suitable for deployment on mobile devices with limited computational resources.

79. Self-Attention (continued)

The Transformer architecture has proven effective in many NLP tasks by enabling models to weigh the importance of different words within a sentence or document relative to each other.

80. Transformer Models

A type of neural network architecture that eschews recurrence and instead relies entirely on self-attention mechanisms to draw global dependencies between input and output, which has been revolutionary in tasks like translation and text generation.

81. BERT (Bidirectional Encoder Representations from Transformers)

A method from Google that pre-trains deep bidirectional representations from the unlabeled text by joint conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

82. Tokenization

In NLP, tokenization is splitting a piece of text into smaller units, called tokens, which can be either words, characters, or subwords. This is often one of the first steps in processing text to be used by a neural network.

83. Word Embeddings

A type of word representation that allows words with similar meanings to have a similar representation. They are a set of language modeling and feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers.

84. Positional Encoding

In the Transformer model architecture, since self-attention mechanisms don’t inherently capture the sequence order, positional encodings are added to input embeddings to provide some information about the relative or absolute position of the tokens in the sequence.

85. Graph Neural Networks (GNNs)

A type of neural network that directly operates on the graph structure. These networks capture the dependence of graphs via messages passing between the nodes of graphs.

86. Reinforcement Learning

A type of machine learning where an agent learns to behave in an environment by performing certain actions and receiving rewards or penalties. This learning method is based on the concept of gamification and is particularly used in scenarios like game-playing and autonomous vehicles.

87. Experience Replay

In reinforcement learning, experience replay involves storing the agent’s experiences at each time step instead of running Q-learning on state-action pairs as they occur. Later, these experiences can be replayed in batches to the agent, breaking the temporal correlations and smoothing over changes in the data distribution.

88. Curriculum Learning

A training strategy that starts by learning easier aspects of a task or earlier stages of a complex task and gradually increases the difficulty level. This approach is inspired by how humans learn and can lead to faster convergence and better performance.

89. Model Pruning

The process of algorithmically removing parameters from an existing neural network without significantly affecting its performance. Pruning helps in reducing the computational cost of deploying models and can also decrease the model size.

90. Continuous Learning

This deep learning term is also known as lifelong learning; it is a form of machine learning where the algorithm continually learns and adapts to new data without forgetting its previous knowledge. This is crucial for applications that operate in dynamic environments.

91. Bias-Variance Tradeoff

A fundamental problem in supervised learning is that increasing the bias will decrease the variance and vice versa. The bias-variance tradeoff is a property that defines the limitations on the accuracy attainable by any model on a given training set.

92. Catastrophic Forgetting

A phenomenon where a neural network forgets previously learned information upon learning new information is a significant challenge in continuous learning.

93. Multimodal Learning

This approach involves training a model on data from multiple modalities, such as a dataset containing images and text. It helps in learning richer representations by combining information from different sources.

94. Anomaly Detection

The identification of rare items, events, or observations that raise suspicions by differing significantly from the majority of the data. This is particularly useful in fraud detection, network security, and fault detection.

95. Out-of-Distribution Detection

Identifying data samples differs in some way from the training distribution. This is critical in safety-critical applications like autonomous driving, where the model must recognize and handle situations it has not been explicitly trained on.

96. Convolution

A mathematical operation used in the inner workings of convolutional neural networks. It involves taking the dot product of a small matrix of numbers (the kernel) with each part of a larger matrix to produce a new matrix, effectively filtering the original.

97. Pooling (continued)

Specifically, pooling layers in CNNs reduces the dimensionality of each feature map while retaining the most important information, which helps detect features invariant to scale and orientation changes and reduces the computational load. Common types of pooling include max pooling and average pooling, which respectively take the maximum and average values of the input region

98. Dilated Convolutions

This deep learning term is also known as atrous convolutions. These involve inserting spaces into the kernel of a convolutional layer, effectively expanding its field of view without increasing the number of parameters or the amount of computation. This is useful for tasks that require understanding larger contexts, such as semantic image segmentation.

99. Sequence-to-Sequence Learning

A process in deep learning where the model is trained to convert sequences from one domain (e.g., sentences in English) to sequences in another (e.g., sentences in French). This model architecture typically involves an encoder-decoder framework and is central to machine translation and speech recognition applications.

100. Attention Mechanisms

Further on the concept, attention mechanisms allow models to focus on different parts of the input sequence as needed to generate the output sequence, improving the model’s ability to handle long sequences in tasks like text summarization and machine translation. Variants like multi-headed attention offer the ability to attend to information from different representation subspaces at different positions.


With these 100 deep learning terms, you have a broad spectrum of deep learning concepts covering architectures, processes, strategies, and specific techniques. Each term is critical in building the foundational knowledge required to engage with the current state and ongoing AI and machine learning developments. Whether for educational purposes or as a reference for professionals, this list encapsulates essential terminology in deep learning.

If you are looking for courses on Deep learning, then explore – Certified AI & ML BlackBelt PlusProgram

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers