March Roundup
Welcome to March’s monthly roundup! This edition covers papers and other reading from the month. Much of this I also share on LinkedIn.
Papers I read
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Find the paper: https://arxiv.org/abs/2402.17764
This is quantisation taken to the extreme! The size of Large Language Models (LLMs) is measured in the number of parameters, which typically ranges between 7 and 70 billion. Storing all those parameters on disk takes a lot of memory - more than most people have on their average laptop.
Engineers reduce the size of models by quantisation - i.e. reducing the precision of the parameters so each individual number is smaller on disk. That makes the whole model smaller. This recent paper from Microsoft takes quantisation to the extreme, by reducing each number to either be -1, 0, or 1, and in doing so significantly shrinking the amount of memory needed. That has the side effect of making these LLMs faster to use, and also reduces their energy consumption. Usually, quantisation comes with a small drop in performance. What’s interesting in this work is that the performance of the model is on a par with unquantised models.
Algorithmic Progress in Language Models
Find the paper: https://arxiv.org/abs/2403.05812
Compute is all you need? This recent paper looks at over 200 LLMs that were published 2012-2023 to tease out the factors driving improvement. The authors found that:
The compute to build a model with fixed performance halves every 8 months. That’s really quite fast when compared to Moore’s law, and also when compared to the speed of change historically seen with other innovations.
Increased compute resources (which effectively means bigger models and more data) are the key driving factor in improved performance, rather than algorithmic improvements
The transformer architecture was a really important step forward, accounting for about 20% of the algorithmic improvements in the past 9 years
The authors point out that it’s hard in practice to disentangle the impact of compute and algorithms as they’re often changed at the same time. Also, that this analysis of existing models says nothing about the speed and scale of future progress. Still, this paper backs up the intuition we all have about this being a fast moving field!
On the Societal Impact of Open Foundation Models
Find the paper: https://arxiv.org/abs/2403.07918
Are open source models beneficial or risky for society? This is a question that's been asked a lot over the past months as increasing numbers of foundation models are being open-sourced. This recent paper takes a look at the societal impact of open source models over and above their closed source counterparts. The authors identify 5 ways that open source models differ:
- Access to open source models is broader
- It's easier to customise open source models using your own data
- People are able to download models and run them locally, rather than send data to services that they don't own
- It's impossible to entirely revoke access once a model has been released
- There's no centralised monitoring of use, so the people releasing the model have no control over how it's used and no way to iterate on the safety features
The authors go on to introduce a risk assessment framework that takes into account the current status quo and analyses the risks of open source models over and above the closed source models. In scenarios like cyber security and voice cloning this framework can give a better assessment of the actual risks of open sourcing models.
Demystifying Embedding Spaces using Large Language Models
Find the paper: https://arxiv.org/abs/2310.04475
How to understand embeddings? Embeddings are used a lot in AI - they’re vectors that represent objects. There are word embeddings, speaker embeddings, image embeddings… and even movie embeddings.
This paper uses movie embeddings to explore whether we can use natural language as a way to interact with embeddings, to visualise and better understand them. The authors create movie embeddings and use them (via an adapter layer) as input to an LLM. The resulting model they train is able to do natural language tasks that mix text and movie embeddings, like “Describe the movie [movie_embedding]”.
Even more, they’re able to do these tasks for arbitrary points on the movie embedding space - i.e. for movies that don’t exist! An example from the paper is the movie which is half ‘Forrest Gump’ and half ‘Inception’, where Forrest learns that he can control his dreams. Inventing new movies is fun! But in a world where embeddings are difficult to understand and interpret, these techniques can move us towards a better understanding of AI models.