June Roundup

Jun 27, 2024

Microsoft’s Small Language Models

LLMs have been getting bigger and bigger, but over the past couple of years Microsoft has been working on a series of small and capable language models. Together, these models show how using highly curated and targeted training datasets can be used to build small and capable language models, ranging from 2.5M to 3.8B parameters. This month’s newsletter looks at these models.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Find the paper: https://arxiv.org/abs/2305.07759

How small can a language model be? Until recently, it’s looked like the ability to generate coherent text comes only with increasingly large model sizes. This paper hypothesises that current large training sets for LLMs are too complex to be modelled effectively by small models, but that a simpler high-quality training set could be used to build smaller models. It looks at building small language models using a synthetic dataset of stories, generated by GPT-4, with vocabulary that a 3-4 year old might recognise.

Models with 2.5M - 80M parameters were built, which is a lot smaller than other language models. By limiting the training data just to children’s stories, these models performed as well, or better, than larger general LLMs on the task of story completion. They also generated coherent and plausible text. Evaluation was done by having GPT-4 score the stories on different axes.

The authors also noticed a level of interpretability with the smallest models, with attention heads and specific neurons being seen to have specific purposes. Overall, this paper shows how curating the training data is another dimension on which LLMs can potentially be improved.

Textbooks Are All You Need (Phi-1)

Find the paper: https://arxiv.org/abs/2306.11644

Phi-1 is a 1.3B parameter Language Model specialised to Python coding. It’s been shown previously that increasing either training data size or model size of LLMs led to more capable models. In this paper, the authors look at the impact of improved training data quality on the model performance, while keeping model and training data size on the smaller side.

The bulk of the training data (6B tokens) was taken from coding sites on the web and filtered to a subset of high educational value. The filtering was done automatically. GPT-4 annotated a small number of examples that were used to train a classifier to predict quality. The remaining 1B tokens in the training set were synthetic examples generated by GPT-3.5. The model was prompted to generate a range of diverse and high-quality textbook examples. A additional smaller set of 200M fine-tuning examples were also synthetically generated by GPT-3.5. These fine-tuning examples were textbook Python coding exercises like you might find in an educational textbook.

Testing was done on a coding benchmark called HumanEval. Heavily curating the training data led to a model with high performance, despite the small model and data size. This model took 4 days to train on 8 A100 GPUs, making it reasonably affordable to build, and shows the potential of data curation for LLM training.

Textbooks Are All You Need II: phi-1.5 technical report

Find the paper: https://arxiv.org/abs/2309.05463

Following on from phi-1 - which was built for the task of coding - Microsoft’s phi-1.5 model explores building small language models for the task of common sense reasoning. Phi-1.5 is the same size and architecture as phi-1, with 1.3B parameters. It uses much more, and much more varied, training data than phi-1 though. Training data includes the coding dataset from phi-1, along with an additional synthetic 20B tokens of ‘textbook’ style data across a range of domains beyond coding. Then, similar to phi-1, a filtered web set was also created, using GPT-4 to aid in the filtering.

On common sense reasoning benchmarks, phi-1.5 performs as well as models that are many times larger. The version trained on filtered web data performs as well, or better, than larger models trained on the unfiltered version of the web data.

Again, this paper adds to evidence that model size is not the only factor in the performance of language models, but that data quality is also key to building capable small language models.

Phi-2: The surprising power of small language models

Find details: Blog post

The only details I can find about phi-2 are in the blog post. The training data for this model was that of phi-1 with additional web data automatically filtered for quality. It has 2.7B parameters and looks to perform a little better than phi-1.5.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Find the paper: https://arxiv.org/abs/2404.14219

Bringing us up-to-date with Microsoft’s small language model explorations is the recently released phi-3 family. Phi-3-mini is a 3.8B parameter model, with 7B and 14B parameter versions promised in the future. As with previous phi models, the training data is a mix of synthetic data and web data that’s been heavily filtered for quality. A two-step training process is used, where first general knowledge data is used, and a second phase focuses on reasoning and other ‘niche skills’.

Phi-3 is more than a base model - it’s been finetuned for chat applications & safety. On some benchmarks like MMLU, its performance is comparable to larger models like Llama3. On other benchmarks though, like trivia questions, it doesn’t perform as well. This is likely because the model doesn’t have the capacity to store factual information. Quantising the model to 4-bits means it can be stored in 1.8Gb, hence it’s able to run on a phone.

As with previous phi models, this paper further demonstrates that the quality of data is another dimension that, along with size, can lead to improved language model performance in certain tasks.

Some Notable Releases

Work with me

I work with organisations who are building AI - as a technical advisor, coach and speaker. Get in touch if you’d like to talk about working together.

AI x Insights

Discussion about this post