April Roundup

Apr 25, 2024

Welcome to April’s monthly roundup - a monthly newsletter, covering papers and other reading I’ve done during each month. I also share on LinkedIn throughout the month.

In April I’m wrapping up my coaching course, making me a qualified coach! So for May’s newsletter, I plan to dive into AI and coaching papers, which is an area getting a lot of attention in the coaching world. Let me know if you have any recommendations to read.

Papers I Read

As Good As A Coin Toss Human detection of AI-generated images, videos, audio, and audiovisual stimuli

Find the paper: https://arxiv.org/abs/2403.16760

How good are people at identifying today’s deepfakes? According to this study, people’s rates of detecting synthetic media in realistic scenarios sits around 50%. And this was the case whether the participant was already familiar with synthetic media or not.

The study looked at image, video and audio data. Participants were best at identifying synthetic audiovisual content, and least accurate at identifying images - especially those that contained human faces.

The authors argue that we can no longer rely on people correctly identifying synthetic media as the main defence against misuse. In a year when about 50% of the world’s population are heading to the polls, and synthetic media is fast becoming easier and easier to generate, it’s increasingly important to find ways to defend against the misuses of synthetic media.

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Find the paper: https://arxiv.org/abs/2309.11674

Can LLMs do better than dedicated models built solely for machine translation? LLMs have the capability to translate between languages. Until recently though, dedicated models have outperformed them at translation.

Machine Translation models are trained using parallel data - that’s equivalent text in different languages. However, parallel data is expensive to create. This paper shows that finetuning LLMs on parallel data at first leads to an improvement in their translation capability. But the performance diminishes as you tune on more and more parallel data, and the authors suggest that this could be down to catastrophic forgetting. The implication is that finetuning LLMs with parallel data can only get you so far towards improving translation.

To get past this, LLaMA-2 is first finetuned on plain non-parallel text data from a mix of languages. As LLMs are trained mostly on English data, this step improves their core multilingual capability. Only then do the authors finetune on a small amount of parallel data. These two steps build an LLM which outperforms previous uses of LLMs for translation, and also outperforms existing dedicated state-of-the-art Machine Translation models like NLLB-54B.

ReALM: Reference Resolution As Language Modeling

Find the paper: https://arxiv.org/abs/2403.20329

How to resolve ambiguous references - like ‘that’ or ‘this’ - that a user makes in conversation? LLMs are ok at doing this in some contexts, but it gets tricky when the entity that the user’s talking about is on their screen and hasn’t yet been part of the conversation. For example, when the user asks for a list of nearby restaurants, and then asks to “call the second”.

This paper from Apple looks at how to encode the user’s screen as plain text so that it can be input to a small language model to resolve the references. The user’s screen is first input to a model that detects entities, their types, and their position on the screen. Then, the entities are encoded as plain text in a left-to-right & top-to-bottom fashion, to retain spatial information. Finally, an LLM is fine-tuned (on data that’s been labelled by annotators) to predict which entity a user is talking about.

The authors choose to encode the user’s screen as text rather than as an image because screenshots are already highly textual, and also are different to the sorts of image that vision models are usually trained on. The fine-tuned LLM in the paper is faster (fewer parameters) than GPT-3.5 and GPT-4. In terms of accuracy, it performs better than GPT-3.5, and is on a par with GPT-4 directly using screenshots. This makes it suitable to be deployed as a lightweight on-device model for reference resolution.

The Unreasonable Ineffectiveness of the Deeper Layers

Find the paper: https://arxiv.org/abs/2403.17887

How to use LLMs if you don’t have a host of GPUs available? Usually, running LLMs at inference time with less computation is achieved via a mix of quantisation and pruning. This paper examines a pruning strategy that identifies and removes layers of a quantised network that are similar to each other. Then, fine-tuning (using QLoRA) is done to “heal” the damage caused by removing layers. The authors find that by doing this they can remove up to about half of the layers of LLaMA-70B before the model performance collapses on downstream question-answering tasks.

Pruning layers makes the model smaller and faster, but also brings with it the realisation that those removed layers are not effectively utilised by the model. In particular, the layers that were removed here tended to be the deeper layers of the network. Thus, the authors hypothesis that it’s the shallower layers of the network that do more work in storing knowledge from the training data. A simpler pruning method of just removing the deepest layers and fine-tuning turns out to work almost as well as the similarity-based pruning.

With this combination of quantisation and layer pruning, the pruned models derived from Llama-70B are able to run at inference time on single GPUs. Inference time is important because LLMs are trained once, but run in inference mode many more times. Making LLMs more efficient saves energy, cost and computation, and allows them to be more easily used by a wider range of people.

Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

Find the paper: https://arxiv.org/abs/2403.10301

Multimodal foundation models for understanding scientific literature? This paper’s collecting data on scientific tasks using expert annotations and user feedback. The data’s used to fine-tune (or build) a multi-modal model that’s better able to work with scientific literature. The paper’s light on modelling details, but the modalities they incorporate are text, chemical formulas & reactions, charts and tables.

Along with evaluating performance on a scientific benchmark (SciAssess), the paper shows examples of the model performing for two additional tasks - patent infringement and chart analysis.

As LLMs and vision models become more capable, there’s increasingly more research extending them towards understanding and working with scientific literature.

The Curse of Recursion: Training on Generated Data Makes Models Forget

Find the paper: https://arxiv.org/abs/2305.17493

How does synthetic data affect model training? This paper set out to look at what happened to increasing generations of ML model that had been trained only on synthetic data from the previous generation. Over the generations, the paper found that the models got increasingly worse until the performance severely degraded. The authors call this 'model collapse'.

The main reason driving model collapse is that you need infinite synthetic samples to fully capture the tails of the underlying distribution. In practice though, you can only generate a finite amount of synthetic data. So, over generations, these tails get modelled less well and the distribution of the synthetic data diverges from that of the initial real-world data.

This setup is hypothetical, but shows us how model collapse could happen as generations of LLMs are used to generate text, that text is published on the internet, and then is used for training the next generations of LLM.

Some Notable model Releases

Databricks DBRX
Mistral 7B v0.2 [download link] (base model that was used for Mistral 7B Instruct v0.2)
Cohere Command R+
Reka Core
Llama-3
Phi-3
OpenELM
FineWeb [dataset]

AI x Insights