Evolution of TPUs and GPUs in Deep Learning Applications

Aryan Garg 22 Jan, 2024 • 8 min read

This article was published as a part of the Data Science Blogathon.

Introduction

This article will briefly discuss some research papers using Graphic Processing Units and Tensor Processing Units in Deep Learning applications.

What are GPUs?

It stands for Graphic Processing Unit, specialized hardware that accelerates graphic rendering in various computer applications. It can process many pieces of data simultaneously, speedily, and effectively. It is used to train multiple heavy Machine Learning and Deep Learning applications. Also, it is heavily used in gaming.

What are TPUs?

It stands for Tensor Processing Unit. It also specialized hardware used to accelerate the training of Machine Learning models. But they are more application-specific in comparison to GPUs. GPUs show more flexibility towards irregular computations. On the other hand, TPUs are well optimized for processing large batches of CNNs due to their specially designed Matrix Multiply Unit.

GPUs — Fig. 2 Graphic Processing Unit
Source – https://www.hp.com/us-en/shop/tech-takes/gpu-buying-guide

The research papers that we have used in this article are:
Paper 1: Specialized Hardware And Evolution In TPUs For Neural Networks
Paper 2: Performance Analysis and CPU vs GPU Comparison for Deep Learning
Paper 3: Motivation for and Evaluation of the First Tensor Processing Unit

Let’s get started, 😉

Paper-1

Summary:
This paper talks about the progression of TPUs from first-generation TPUs to edge TPUs and their architectures. This study examines CPUs, GPUs, FPGAs, and TPUs and their hardware designs, similarities, and differences. Modern neural networks are now widely employed, although they need more time, processing, and energy. Due to market demands and economic considerations, the production of several types of ASICs (application-specific integrated circuits) and research in this area is increasing. Many CPU, GPU, and TPU models are made to assist these networks and improve the training and inference phases. Intel created CPUs, NVIDIA created GPUs, and Google created cloud TPUs. CPUs and GPUs may be sold to corporations, while Google offers everyone TPU processing from the cloud. When we move the data away from the computational source, it raises the total cost. Hence organizations adopt memory management and caching solutions near ALUs to lower this cost.

Drawbacks:
Artificial Intelligence is the most widely running technology in the industry, and Neural Networks are very widely employed. The CPU can process neural networks, but it takes a lot of time to process them. It comes as the first drawback concerning the CPU. On the other hand, GPUs are 200-250 times faster than CPUs in Deep Learning and Neural Networks, but when It comes to price, these are very costly to CPUs. It comes as a second drawback. In this case, TPUs are much faster than GPUs. TPUs are almost ten times faster than GPUs. But, To simplify design and debugging, TPU does not fetch instructions to execute directly, but the host server sends an education to TPU. Again, the cost is another drawback for TPU. As we know, Google has developed TPU, so, It is present only in google data centers. We can not use TPU. Personally, this comes under the third drawback concerning TPU, but we can also access TPU servers through a google service named Google Colab.

Future Work:
TPUs would be used in the future for Neural Networks because TPUs are designed for these specific purposes. So, they decrease the overall training cost in deep neural networks. We expect it to be implemented for the general purpose at an affordable price around us. Also, It can include a broad area of machine learning models and can be used in other aspects of Artificial Intelligence (AI), including intelligent cameras etc. Also, It should be elastic and adaptable to future technologies, including quantum computers.

Paper-2

Summary:
In this paper, we have done the Performance Analysis and CPU vs. GPU Comparison for Deep Learning. Performance analysis tests were conducted using a deep learning application to classify web pages. Some performance-related hyperparameters have been examined. Furthermore, the tests were carried out on both CPU and GPU servers operating in the cloud for the test cases to affect different CPU specifications, batch size, hidden layer size, and transfer learning. According to the findings, increasing the number of cores reduces operating time. Similarly, increasing the center operating frequency has boosted the system’s operating speed. They are growing the parallel capabilities of the process by increasing the batch size. On this point, tests done on the GPU with a large batch size show that the system is accelerated. The success rates in trials in which the system learns word vectors gradually increase. As a result, a large number of epochs exist. Training may be required to achieve the desired level of success. Even in low epochs, we can use transfer learning to create a model with a high success rate. Overall, all of the tests run on the GPU were faster than on the CPU.

Drawbacks:
One of the significant drawbacks of GPUs is that they were initially built to implement graphics pipelines rather than deep learning. GPUs were used for deep understanding since it uses the same type of computing (matrix multiplications). We also testes some performance analysis instances in these experiments, such as batch size, hidden layer size, and transfer learning. The main drawback of using hidden layers is that as the number of hidden layers grows, so do the parameters that must learn. As a result, the training’s duration is lengthened. Although increasing the number of layers extends the training period for the web page classification problem, it does not affect the success rate. Also, another drawback is that in the tests in which the system learns the word vectors, the success rates are slowly increasing. As a result, we require more epochs of training to obtain the target success levels. As the epoch training increases, the computation time increases, also causing significant heating, and it can also cause overfitting. Overfitting is used in data science to describe when a statistical model fits its training data perfectly. The model is “overfitted” when it memorizes the noise, works too closely to the training set, and cannot generalize adequately to new data.

Future Work:
It is planned to optimize the performance analysis to improve the success rates in the subsequent experiments, which will take place shortly. The next big future move made after GPUs is Google’s TPU(Tensor Processing Unit). The TPU is 15x to 30x faster than current GPUs and CPUs on production AI applications that use neural network inference. TPU also outperforms traditional processors in terms of energy efficiency, with a 30x to 80x increase in TOPS/Watt (tera-operations [trillion or 10^12 operations] of processing per Watt of energy required).

Paper-3

Summary:
This study explains why TPUs are useful and compares their performance to GPUs and CPUs on well-known Neural Network benchmarks. TPUs have an advantage in Deep Neural Networks for a variety of reasons, including the fact that their specially designed Matrix Multiply Unit (also known as the heart of the TPU) performed Matrix Chain Multiplication of 2D arrays in O(n) time. In contrast, traditional methods take O(n3) time to complete the same multiplication. Furthermore, the TPUs hardware components are designed in such a way to keep the Matrix Multiply Unit busy at all times to get the best out of it. TPUs are also not intimately integrated with the CPU. Instead, they use a PCI Express I/O Bus to connect to existing servers, similar to GPUs. The performance of the TPU, Haswell CPU, and K80 GPU is also evaluated in the article using ML applications like MLP, LSTM, and CNNs. TPUs’ TDP (Thermal Design Power) per Chip is substantially lower than that of CPUs and GPUs, according to our findings. TPUs outperform CPUs and GPUs regarding roofline performance (i.e., TeraOps/Sec). Furthermore, they did not mention the actual cost of TPU and the cost-to-performance ratio, and TPU excels again. Finally, this work aims to comprehend the significance of DSAs (Design Certain Architectures) such as TPUs and how they might help complete specific tasks.

Drawbacks:
The TPU proposed in this research is a first-generation TPU (TPU v1), which can only predict new data from a model that has already been trained. It cannot train a new machine learning model from the ground up. TPU architecture needs to be improved more to do this. TPUs also transform average IEEE 32-bit floating point numbers to 8bit integers, which saves energy and reduces chip size while lowering precision. Furthermore, the cost of TPUs is not disclosed, making it difficult to determine the optimum design for us. Another drawback is that TPU is a format-specific architecture, so it can drop features required by CPUs and GPUs that Deep Neural Networks don’t use. It outperforms only in image processing applications. Also, the TPU we utilize presently only supports Tensorflow; it does not support other Python libraries such as Keras.

Future Work:
We may use improved versions of TPUs, such as TPUv2 or TPUv3, to train machine learning models from scratch utilizing TPUs. They have a large heat sink and four chips per board instead of one. We can use them for training as well as testing. Furthermore, we may convert 32bit floating point numbers to 16bit integers rather than 8bit integers, resulting in the best precision and power usage. TPUs can also be developed in such a way that they outperform non-neural network applications also. Furthermore, TPUs must have the widespread support of libraries for more work to be done on them.

Comparisons

Abstraction:
In paper 1, we have compared the hardware structure for CPU, GPU, and TPU and the evolution of TPUs over the years.
In paper 2, we have compared the performances of CPU and GPU by testing them on Web Page Classification Dataset using Recurrent Neural Network architecture.
In paper 3, we have compared the performances of CPU, GPU, and TPU on standard deep learning applications like Multi-Layer Perceptron, Long Short Term Memory, and Convolutional Networks.

Methodology:
In Methods of Paper 1, We have discussed the hardware structure of TPU, GPU, and CPU are explained in detail. In this, we have found that CPU uses one 1D array to execute one instruction at a time, GPU uses multiple 1D arrays to manage one instruction at a time, and TPU uses a single 2D matrix to execute one instruction at a time.
In Methods of Paper 2, we have prepared our dataset and pre-processed it to achieve the best results using RNN Architecture in deep learning, word embeddings, and transfer learning on test cases.
In Methods of Paper 3, we explored the fundamentals of TPU, Moore’s Law, Matrix Chain Multiplication in TPU, its design analysis, and the reasons why it outperforms others.

Observations:
In paper 1, we classified the hardware structure of the CPU, GPU, and TPU. Furthermore, we discovered that CPUs are far less efficient than GPUs regarding Deep Learning and neural networks, whereas TPUs are designed specifically for these tasks. As a result, TPUs are extremely quick compared to the CPU and GPU.
In paper 2, We compared the performance analyses of CPU and GPU using test cases for different CPU specifications, batch size, hidden layer size, and transfer learning, and discovered that GPU is significantly faster than CPU.
In paper 3, we examined the CPU, GPU, and TPU on the supplied neural network designs and discovered that the TPU outperforms both in terms of performance, power consumption, and chip area size.

Conclusion

In this article, we have discussed how the technology has evolved so that training the processing of deep learning models can become faster and more accurate. We have seen that TPUs came with a specialized Matrix Multiply Unit that performs Matric Chain Multiplication in linear time. Still, a traditional CPU can take cubic time to complete the same multiplication.

Nowadays, both GPUs and TPUs can come with a plug-and-play model, i.e., we can plug them into the existing CPU Hardware using PCle Express Ports. This can enable these devices’ horizontal and vertical scaling so that we can add or remove them according to our computational requirements.

Key takeaways of this article:
1. The first paper discussed the fundamental differences between CPUs, GPUs, and TPUs. There is processing power and some basic architectural design.
2. The second paper discussed Graphic Processing Units, especially in Deep Learning applications. It also studies the change in performance by changing batch size, number of epochs, learning rate, etc.
3. Third paper deeply discusses the architectural design of the Tensor Processing Units. What are its power consumptions, thermal design, area requirements, etc?
4. Lastly, we have performed a basic comparison among all the research papers.

It is all for today. I hope you have enjoyed the article. You can also connect with me on LinkedIn.

Do check my other articles also.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.