Scaling- Transformers, Laws and Challenges

Drishti Sharma 08 Aug, 2022 • 7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

The other day, I was reading “Natural Language Processing with Transformers” a book authored by Lewis Tunstall, Leandro von Werra, and Thomas Wolf. In it, I came across Scaling laws and challenges associated with scaling Transformer models. This topic also included excerpts from Richard Sutton’s startling essay titled “The Bitter Lesson,” which suggested that embedding/relying on a human-knowledge approach can complicate things, making them less suited to taking advantage of general methods leveraging computation. The mention of embedding human knowledge piqued my interest to delve deeper into the matter. It reminded me of a conversation I had with a respectable AI Researcher on Twitter earlier this year.

Sacling

Source: Canva

In this post, we will be exploring:

What are the potential drawbacks of embedding human knowledge of the domain directly into the models than leveraging the computation?
Why is scaling required?
What are scaling laws?
What are the Challenges with scaling transformers?

Now let’s dive in!

Drawbacks of encoding human knowledge directly into the model

Following are the potential drawbacks of embedding human knowledge directly into the model than leveraging the computation:

1) Leveraging human knowledge leads to quicker gain. Still, in the long run, methods leveraging computation outperform: In 2019, the researcher Richard Sutton published a startling essay titled “The Bitter Lesson ,“ in which he stated general methods that leverage computation are the most effective, and by a large margin. Leveraging human knowledge of the domain can offer quick gains in the short term. However, leveraging the computation can outperform the former in the long run. Furthermore, the human-knowledge approach often complicates methods, making them less suited to taking advantage of general methods leveraging computation.

In games like chess or Go, the concept of encoding human knowledge within AI systems was eventually outdone by greater computation. Sutton refers to this as the “bitter lesson” for the AI research community. There are signs that a similar lesson is at play with transformers. At the same time, many early BERT and GPT descendants focused on tweaking the architecture or pretraining objectives. The best-performing models like GPT-3 are basically scaled-up versions of the original models without many architectural modifications.

2) Human biases: If we decide on encoding the human knowledge directly into the model, we might also be encoding our biases. Biases could be inflicted from a person’s experience/knowledge and discretion. So, data curation is a highly subjective and sensitive matter. Furthermore, even the entire ML pipeline can be biased as well. However, utilizing a massive data pool gives us a semblance of being more unbiased than hardwiring human knowledge directly.

3). Exploration Vs. Exploitation: While studying Reinforcement Learning, I came across this concept that resonated with me. Although we are talking about the Transformers specifically, this concept can also be applied here. While exploiting our existing knowledge can provide us the semblance of quick gains, exploration by leveraging vast datasets and scaling the model could lead to long-term benefits and help us explore the unexplored.

4). Is our existing understanding of various domains absolutely correct? : The dilemma is, is our existing knowledge perfect enough that we can solely rely on that and avoid further exploration? Often, even scientists do not conclude and claim something as “absolute truth.” They often cite the results saying something like, based on these (limited set of) observations. It appears to be […….]. Furthermore, the scientific community may hold opposing views based on their different experiment outcomes and limited observations.

Why is Scaling required?

Empirical evidence suggests that large language models perform better on downstream tasks, and capabilities such as zero-shot and few-shot learning emerge in the 10 to 100 B parameter range. However, the number of parameters is not the only key factor affecting the model’s performance. The performance of language models seems to obey a power law relationship with model size and other factors. Hence, the amount of computing and training data need to be scaled in tandem to train these models.

In fig. 1, we can see a timeline of the development of the largest models since the release of the original Transformer architecture in 2017, illustrating that model size has increased by over four orders of magnitude in just a few years!

Language model parameter counts

Fig. 1 Parameter counts over time for prominent Transformer architectures

(Source: https://bit.ly/3Ovuvdy)

Scaling Laws

Scaling laws allow empirically quantifying the “bigger is better” paradigm for language models by studying their behavior with varying compute budgets C, dataset size D, and model size N. The key idea is to plot the dependence of the cross-entropy loss L on these three factors and see if a relationship emerges. For autoregressive models like those in the GPT family, the resulting loss curves are shown in Fig. 2, where each blue curve represents the training run of a single model.

We can infer a few things from these loss curves:

1. The relationship between performance and scale: Although hyperparameter optimization is typically emphasized to improve performance on a fixed set of datasets, scaling laws suggest a more productive path toward better models is to focus on increasing N, C, and D in tandem.

2. Smooth Power Laws: The test loss L has a power law relationship with each of N, C, and D across several orders of magnitude (power law relationships are linear on a log-log scale). These power-law relationships can be expressed as follows:

where, : scaling exponent

Typical values lie in the 0.05–0.095 range, and one salient feature of these power laws is that the early part of a loss curve can be extrapolated to predict the approximate loss if the training was conducted for much longer.

3. Sample efficiency: Large models can achieve the same performance as smaller models with fewer training steps. The loss curve plateaus over some training steps, indicating that one gets diminishing returns in performance compared to simply scaling up the model. Fig. 3 illustrates the scaling laws for modalities like Images, videos, mathematical problem solving, language, etc.

scaling-laws-modal

It is still unclear whether power-law scaling is a universal property of transformer language models. And given the cost constraint, it’s extremely desirable to have an estimate of the model’s performance in advance. Presently, scaling laws are used as a tool to extrapolate large, expensive models without explicitly training them. However, scaling isn’t as simple as it sounds. Let’s take a look at some of the challenges with scaling.

Challenges with Scaling

While scaling up a model sounds easy in theory. It poses numerous challenges in practice. Following are some of the most significant challenges one is likely to encounter when scaling language models:

1. Bias: Scaling can escalate problems related to model bias and toxicity, which often emerges from the training data used to train the models. Large language models can reinforce prejudices and potentially impact individuals and communities disproportionately. Moreover, these models can produce false/misleading information or outright disinformation.

2. Infrastructure: Provisioning and managing infrastructure that potentially spans hundreds or thousands of nodes with as many GPUs are not for the faint-hearted. Are the required number of nodes available? Is communication between nodes a bottleneck? Tackling these issues requires a very different skill set than that found in most data science teams. It typically involves specialized engineers who run large-scale, distributed experiments.

3. Dataset Curation: A model is only as good as the data it is trained on. Large models are data hungry. When dealing with terabytes of text data, ensuring that the dataset contains high-quality, the bias-free text becomes supremely challenging. Furthermore, even the processing becomes difficult. Another challenge is licensing the training data and personal information.

4. Training Cost: Most companies can’t afford the teams and resources required to train models at the largest scales. Given the cost constraint, estimating the model’s performance in advance is extremely desirable.

5. Model Evaluation: Evaluating the model on downstream tasks requires time and resources. Additionally, the model needs to be probed for biased and toxic generations. These steps take time and must be carried out thoroughly to mitigate the risks of adverse effects later.

6. Reproducibility: AI already had a reproducibility problem. Researchers often publish benchmark results instead of source code, which becomes problematic when the thoroughness of the benchmarks is questioned. But the massive computing required to evaluate large language models exacerbates the problem.

7. Explainability: We have often seen researchers claim that ‘new numbers of parameters in our system yielded this new performance on this benchmark,’ but it’s tough to explain exactly why the system achieves this!

8. Benchmarking: Even with enough compute resources, benchmarking large language models is tedious. Some experts contend that popular benchmarks poorly estimate real-world performance and fail to account for the broader ethical, technical, and societal implications. For example, one recent study found that 60% to 70% of answers given by natural language processing models were encoded in the benchmark training sets, indicating that the models were just memorizing answers. Considering this, ways of measuring the performance of these systems need to be expanded … When the benchmarks are changed a little bit, they often don’t generalize well.

9. Deployment: Serving large language models is a major challenge. Although techniques like distillation, pruning, and quantization help in this regard. However, these techniques might not be viable for a model that is 100s of GBs in size. Hosting services such as the OpenAI API or Hugging Face’s Accelerated Inference API assist companies that cannot or do not want to deal with these deployment challenges.

10. Fixing a mistake is a costly affair: The cost of training makes the retraining infeasible. Even OpenAI, which receives massive funding from Microsoft, struggles with this and preferred not to fix the mistake when GPT-3 was implemented. Most companies cannot afford the teams and resources necessary to train/retrain models at the largest scales. Training a single GPT-3-sized model can cost several million dollars, which is not the kind of money companies have.

Constraints Foster Creativity

Resource constraints can lead to novel solutions with implications beyond the problem they were originally designed to solve. For example, DeepMind published a research paper for a language model called RETRO, which claims to outperform other models 25 times its size by using external memory techniques.

Conclusion

Leveraging human knowledge of the domain can offer quick gains in the short term. However, leveraging the computation can outperform the former in the long run. The human-knowledge approach often complicates methods, making them less suited to taking advantage of general methods leveraging computation. Scaling laws are employed to extrapolate large, expensive models without explicitly training them. Scaling laws allow empirically quantifying the “bigger is better” paradigm for language models by studying their behavior with varying compute budgets, dataset size, and model size.

To sum it up, in this post, we learned about:

1. The potential drawbacks of embedding human knowledge of the domain directly into the models than leveraging the computation.

2. Importance of Scaling the Transformer models.

3. Scaling laws

4. Challenges associated with scaling the models.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Drishti Sharma 08 Aug 2022

Deep Learning Intermediate Machine Learning python Structured Data