The world needs something better than the Transformer. I believe that everyone here hopes it can be replaced by something that will take us to a new level of performance. This article shares the conversation between Nvidia CEO Jensen Huang and the authors of the renowned paper on Transformers, discussing the future of language models. The article is sourced from Tencent Technology and compiled by DeepTech.
Summary:
This article discusses the impressive viewpoints expressed during the conversation.
Transcript:
Introduction to the Transformers
Reason for establishing the Transformers
Issues that Transformers aim to solve
Reason for establishing the company
In 2017, a groundbreaking paper titled “Attention is All You Need” introduced the Transformer model, which is based on self-attention mechanisms. This innovative architecture broke free from the constraints of traditional RNN and CNN models. By using parallel processing attention mechanisms, it effectively overcame the challenge of long-range dependencies and significantly improved the speed of sequence data processing. The Transformer’s encoder-decoder structure and multi-head attention mechanism created a storm in the field of artificial intelligence, with the highly popular ChatGPT being built on this architecture.
Imagine the Transformer model as your brain when conversing with a friend. It can simultaneously pay attention to every word spoken and understand the connections between them, providing computers with human-like language understanding capabilities. Prior to this, RNN was the mainstream method for language processing, but its information processing speed was slow, similar to an old-fashioned tape player that had to play word by word. The Transformer model, on the other hand, is like an efficient DJ that can control multiple tracks simultaneously and quickly capture key information.
The emergence of the Transformer model significantly enhanced the computer’s language processing capabilities, making tasks such as machine translation, speech recognition, and text summarization more efficient and accurate. This is a huge leap for the entire industry.
This innovative achievement was the result of the collective efforts of eight AI scientists who had previously worked at Google. Their initial goal was simple: to improve Google’s machine translation service. They wanted machines to fully understand and comprehend entire sentences, rather than translating word by word in isolation. This concept became the starting point of the “Transformer” architecture – the “self-attention” mechanism. Building on this, the eight authors, each contributing their expertise, published the paper “Attention is All You Need” in December 2017, detailing the Transformer architecture and opening a new chapter in generative AI.
In the world of generative AI, the scaling law is a core principle. In simple terms, as the size of the Transformer model expands, its performance also improves. However, this also means that more powerful computing resources are required to support larger models and deeper networks. NVIDIA, which provides efficient computational services, has become a key player in this AI wave.
During this year’s GTC conference, NVIDIA CEO Jensen Huang invited the seven authors of the Transformer (Niki Parmar couldn’t attend due to unforeseen circumstances) to participate in a panel discussion. This was the first public appearance of all seven authors together.
The world needs something better than the Transformer. I believe that everyone here hopes it can be replaced by something that will take us to a new level of performance.
We didn’t succeed in our initial goals, as our intention in starting the Transformer was to simulate the evolution of tokens. It is not just a linear generation process but a step-by-step evolution of text or code.
Simple questions like “2+2” might require the computational resources of a large model with trillions of parameters. I think adaptive computation is one of the things that must happen next. We need to know how much computational resources should be spent on specific problems.
I think the current models are too economically efficient and the scale is still too small. The price of 1 million tokens is approximately 1 dollar, which is 100 times cheaper than buying a paperback book.
Jensen Huang: In the past 60 years, it seems that computer technology has not undergone fundamental changes, at least not since the moment I was born. The computer systems we currently use, whether it’s multitasking, hardware and software separation, software compatibility, or data backup capabilities, and software engineering skills, are basically based on the design principles of IBM System360 – central processors, BIOS subsystems, multitasking, hardware and software, and software system compatibility, etc.
I believe that since 1964, there has been no fundamental change in modern computing. Although there was a major transformation in the 1980s and 1990s, which formed the forms we are familiar with today, as time goes by, the marginal cost of computers continues to decrease. The cost decreases by ten times every decade, a thousand times every fifteen years, and ten thousand times every twenty years. In this computer revolution, the cost reduction is so significant that in twenty years, the cost of computers has almost decreased by ten thousand times. This change has brought tremendous momentum to society.
Imagine if all expensive items in your life were reduced by one ten thousandth of their original cost, such as a car that you bought for $200,000 twenty years ago now only costs $1. Can you imagine this change? However, the decline in computer costs did not happen overnight, but gradually reached a turning point. After that, the trend of cost reduction suddenly stopped. It still improves slightly every year, but the rate of change remains stagnant.
We started exploring accelerated computing, but using accelerated computing is not easy. You need to redesign step by step. In the past, we might have solved problems step by step according to established procedures, but now we need to redesign these steps. This is a completely new scientific field, rearticulating the previous rules into parallel algorithms.
We realized this and believed that if we could accelerate even 1% of the code and save 99% of the execution time, there would definitely be applications that would benefit from it. Our goal is to make the impossible possible, or to make what is possible more efficient. That is the meaning of accelerated computing.
Looking back at the history of the company, we found that we have the ability to accelerate various applications. Initially, we achieved significant acceleration in the gaming industry, so much so that people mistakenly thought we were a gaming company. However, our goal goes beyond that because this market is huge, driving incredible technological advancements. This situation is rare, but we found this exceptional case.
Long story short, in 2012, AlexNet sparked the fire, marking the first encounter between artificial intelligence and NVIDIA GPUs. This marked the beginning of our magical journey in this field. A few years later, we found a perfect application scenario that laid the foundation for our current development.
In short, these achievements have laid the foundation for the development of generative AI. Generative AI can not only recognize images but also transform text into images and even create new content. Now, we have sufficient technical capabilities to understand pixels, identify them, and comprehend their meanings. Through these meanings, we can create new content. The ability of artificial intelligence to understand the meaning behind data is a tremendous transformation.
We have reason to believe that this is the beginning of a new industrial revolution. In this revolution, we are creating something unprecedented. For example, in previous industrial revolutions, water was the source of energy. Water enters the devices we create, and the generators start working. Water in, electricity out, like magic.
Generative AI is a new kind of “software” that can also create software. It relies on the collective efforts of numerous scientists. Imagine giving AI raw materials – data – and they enter a “building” – what we call GPU machines – and magical results come out. It is reshaping everything. We are witnessing the birth of an “AI factory.”
This transformation can be called the beginning of a new industrial revolution. We haven’t really experienced this kind of transformation in the past, but now it is slowly unfolding before us. Don’t miss the next decade because in these ten years, we will create tremendous productivity. The clock of time has started, and our researchers have begun to act.
Today, we have invited the creators of the Transformer to discuss where generative AI will lead us.
Ashish Vaswani: Joined Google Brain in 2016. In April 2022, co-founded Adept AI with Niki Parmar, and later in December, left the company to co-found another AI startup called Essential AI.
Niki Parmar: Worked at Google Brain for four years, co-founded Adept AI and Essential AI with Ashish Vaswani.
Jakob Uszkoreit: Worked at Google from 2008 to 2021. In 2021, left Google and co-founded Inceptive, a company focused on AI life science, aiming to design the next generation of RNA molecules using neural networks and high-throughput experiments.
Illia Polosukhin: Joined Google in 2014 and was one of the first to leave the eight-person team. In 2017, he co-founded the blockchain company NEAR Protocol with others.
Noam Shazeer: Previously worked at Google from 2000 to 2009 and 2012 to 2021. In 2021, Shazeer left Google and co-founded Character.AI with former Google engineer Daniel De Freitas.
Llion Jones: Worked at Delcam and YouTube. Joined Google in 2012 as a software engineer. Later, he left Google and founded an AI startup called sakana.ai.
Lukasz Kaiser: Former researcher at the French National Center for Scientific Research. Joined Google in 2013 and left in 2021 to become a researcher at OpenAI.
Aidan Gomez: Graduated from the University of Toronto in Canada. He was an intern at Google Brain when the Transformer paper was published. He was the second person from the eight-person team to leave Google. In 2019, he co-founded Cohere with others.
Jensen Huang: Today, I invite everyone to actively seize the opportunity to speak. There is no topic that cannot be discussed here. You can even jump up from your chair to discuss the problem. Let’s start with the most basic question. What problems did you encounter at the time, and what inspired you to create the Transformer?
Illia Polosukhin: If you wanted to release a model that could truly read search results, such as processing piles of documents, you needed a model that could handle this information quickly. At the time, recurrent neural networks (RNNs) could not meet such requirements.
Indeed, at that time, although RNNs and some initial attention mechanisms (ARNNs) had attracted attention, they still needed to read word by word, which was not efficient.
Jakob Uszkoreit: The speed at which we generated training data far exceeded the capabilities of our most advanced architectures. In fact, we used simpler architectures, such as feed-forward networks with n-grams as input features. These architectures usually outperformed more complex and advanced models in large-scale training data at Google due to their faster training speed.
Powerful RNNs, especially long short-term memory networks (LSTMs), already existed at that time.
Noam Shazeer: It seemed to be an urgent problem that needed to be solved. Around 2015, we began to notice these scaling laws, where as the model size increased, its intelligence also improved. It was like the best problem in world history, and it became a race to see who could design larger models.