When you look at what Google has just achieved, it’s no wonder OpenAI suddenly released Sora a few hours later to distract the world from the fact they aren’t in the lead anymore in the Large Language Model (LLM) space.

Google’s Gemini 1.5 Pro is a generational leap in terms of Multimodal Large Language Models, or MLLMs, much like GPT-4 was to LLMs back in March 2023.

Specifically, it‘s capable of processing millions of words at a time, 40-minute-long videos, or 11 hours of audio in seconds with 99% context retrieval accuracy, absolutely unheard of in the field until today.

The long-sequence era has arrived, and with it, a new dominant player looks down at OpenAI for the first time.

The King Reclaims its Throne

In November 2022, the unequivocal king in the AI industry for more than a decade, Google, saw as a partially unknown (at least to the general public) Microsoft-backed company, OpenAI, launched a product, ChatGPT, that changed the narrative completely and sent them to the runner-up position.

Sam, the most hated man in Google HQ

AI suddenly became the most important technology, but at the same time, Google was no longer viewed as the forefront of the space.

Sam Altman had released the most powerful LLM the world had ever seen, ChatGPT.

At the same time, Google had nothing to offer. Sure they had things going on inside their Mountain View HQ like LAMDA, but it was nowhere near the quality and production readiness that ChatGPT had.

And the most painful part?

ChatGPT was based on the Transformer, an architecture that, to the dismay of many Google shareholders at the time, had been created by Google researchers back in 2017.

In other words, it was as if Google had been sitting on its hands while having the ‘secret sauce’ to the next technological leap in a dusty cupboard.

Unforgivable.

Naturally, Google got the memo and set itself to work.

All roads led to Gemini

After a while, Google released Bard, a complete and utter disaster if compared with OpenAI’s GPT-4 model running behind ChatGPT released in March of 2023.

Now Google looked even more behind than ever.

Then, at the end of 2023, Google finally released Gemini 1.0, a family of natively Multimodal LLMs, natively multimodal meaning they were trained from the ground up to process video, images, and text, while also being able to generate text, code, and images, that put the search company at least at the level of OpenAI’s GPT-4, if we consider Gemini 1.0 Ultra, the most capable model.However, if we think about timings, this was not special at all.

At the end of the day, Gemini was released around November last year to just manage to compete with a product that OpenAI had released back in March.

Unsurprisingly, to avoid contempt from the industry, they swiftly released Alphacode 2 at the same time, a revolutionary model that combined Gemini with a search algorithm and test-time computation to allow AI to compete at the premium level of competitive programming, scoring an astonishing 85% percentile.

Gemini, the Long-Range SuperModel

Put simply, Gemini 1.5, is beyond impressive.

Although we only have the results of the Pro model, the mid-sized one, presumably indicating that an even better model is coming soon, the scores are incredible.

Reminder: The Gemini family is divided into three groups, from smaller (thus worse) to bigger (thus best), Nano, Pro, and Ultra.

For starters, it has the longest compute-and-performance-optimized context window known to humans, up to 10 million tokens.But what is a token and what is the context window?

Tokens are the units used by Transformers to process and generate data. In the case of text, they are usually between 3 to 4 characters. For instance, although this will depend on the tokenizer you use (a model that divides your text into tokens), ‘London’ could be divided into ‘Lon’ and ‘don’ tokens.
On the other hand, the context window is the largest amount of tokens an LLM can process at any given time. It is its real-time memory, and it’s akin to what Random-access memory (RAM) is to computer processors.

Context windows exist for a simple reason, long sequences are expensive and hard to model.

Specifically, the costs of running an LLM have quadratic complexity relative to the sequence length. In layman’s terms, if you double the length of the sequence you give them, the cost quadruples.

Also, Transformers suffer greatly from performance degradation when working with sequences longer than what it was trained for.

This problem is known as extrapolation (although other design features like choosing the correct positional embeddings factor in too, but that’s a conversation for another time).

Think about extrapolation as if you have trained yourself to run 5 miles per day and suddenly one day you unexpectedly go for 15. Naturally, those 10 extra miles are going to be harder and you are going to perform much worse.

But, to understand the size of Google’s context window increase, how much is 10 million tokens?That is:

Around 7.5 million words, or around 15.000 500-word pages, which is far more than the entire Harry Potter saga.
44-minute-long silent videos
6–8 minutes of a standard frame YouTube video

In one go.

ads

LGND TECH TIPS

Search This Blog

thx

Google Has Finally Dethroned ChatGPT

The King Reclaims its Throne

Sam, the most hated man in Google HQ

All roads led to Gemini

Gemini, the Long-Range SuperModel

Comments

Post a Comment