Under the hood of the transponder, the engine that drives the evolution of the AI ​​model

Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more

Today, in fact, every cutting-edge AI product and model uses a transformer architecture. Large Language Models (LLM), such as GPT-4O, Llama, Gemini and Claude, are all based on Transformers, as well as other AI applications such as text-to-speech, automatic speech recognition, image generation, image generation and text-to-video models with Transformers as their basic technology.

Since the hype around AI is unlikely to slow down anytime soon, it’s time to let Transformers expire, which is why I want to explain how they work, why they are for the growth of scalable solutions and why they are so important The backbone of LLM.

Transformers are more than just eyes

In short, a transformer is a neural network architecture designed to simulate data sequences, making it ideal for tasks such as language translation, sentence completion, automatic speech recognition, etc. Transformers have indeed become the dominant architecture for many of these sequence modeling tasks, as potential attention mechanisms can be easily parallelized, allowing for large-scale scale when training and performing inference.

Originally introduced in a 2017 paper”Pay attention is what you need“Researchers from Google introduced transformers as an encoder encoder architecture designed specifically for language translation. The following year, Google released a bidirectional encoder from Transformers (BERT), which could be considered the first LLMS. One, although it is now considered small by today’s standards.

Since then – especially as GPT models accelerated from the advent of OpenAI, this trend has been to use more data, more parameters and longer context windows to train larger and larger models.

To facilitate this development, there are many innovations, such as: more advanced GPU hardware and better multi-GPU training software; technologies such as mixtures of quantization and experts (MOEs) for reduced memory consumption; new optimizations for training , such as shampoo and Adamw; techniques for efficient calculation of attention, such as flash memory and KV cache. This trend may continue for the foreseeable future.

The importance of self-attention in transformers

According to the application, the transformer model follows the encoder architecture. The encoder component learns a vector representation of the data, which can then be used for downstream tasks such as classification and sentiment analysis. The decoder component takes a vector or potential representation of text or image and uses it to generate new text, making it available for tasks like sentence completion and summary. Therefore, many familiar state-of-the-art models, such as the GPT family, are just decoders.

The encoder model combines two components that make it useful for translation and other sequence-to-sequence tasks. For encoder and decoder architectures, the core component is the attention layer, as this is a reason why the model can preserve context from words that appear earlier in the text.

There are two flavors of attention: self-attention and cross-attention. Self-attention is used to capture the relationship between words in the same order, while cross-attention is used to capture the relationship between words between two different sequences. Cross attention connects the encoder and decoder components during the model and translation process. For example, it allows the English word “strawberry” to be associated with the French word “fraise”. Mathematically, self-notes and crossnotes are different forms of matrix multiplication and can be used to perform extremely efficient efficiency using the GPU.

Thanks to the attention layer, Transformers can better capture the relationship between long text-separated words, while previous models such as recurrent neural network (RNN) and long-term short-term memory (LSTM) models lose their background from early word tracking in text.

The future of the model

Currently, transformers are the main architecture for many use cases that require LLM and benefit from most R&D. Although this seems unlikely to change anytime soon, a different category of models that have recently attracted interest is the State Space Model (SSM), such as Mamba. This efficient algorithm can handle long data sequences, while Transformers are limited by context windows.

For me, the most exciting application of transformer models is the multimodal model. For example, OpenAI’s GPT-4O is able to process text, audio, and images – other providers are also beginning to follow. Multimodal applications are very diverse, from video subtitles to voice cloning to image segmentation (and more). They also provide an opportunity to make AI more accessible to people with disabilities. For example, the ability to interact through the voice and audio components of a multimodal application can greatly serve the blind.

This is an exciting space with great potential to discover new use cases. But remember that, at least for the foreseeable future, is largely supported by Transformers architecture.

Terrence Alsup is a senior data scientist Finastra.

DatadeCisionMaker

Welcome to the VentureBeat community!

DatadeCisionMakers are experts, including technicians working in data, who can share insights and innovations related to data.

If you want to read about the latest ideas and latest information, best practices, and the future of data and data technology, join DatadeCisionMakers.

You might even consider writing your own article!