December Roundup
Hi, and welcome to December’s monthly roundup!
This is the place that I share some of my current work and reading.
Large Speech Models
Just like NLP has moved towards large language models, in the world of voice technology we’re moving towards large speech models. This month Amazon announced that Amazon Transcribe - their automated speech recognition service - has moved to a new “multi-billion parameter” speech foundation model. This means they can now support more languages with better accuracy, which was difficult to do with previous generations of model. Other large speech foundation models of note come from Meta. Their new SeamlessM4T v2 and Audiobox were both announced in the last month. SeamlessM4T is focused on speech-to-speech translation, while Audiobox is all about audio generation. While large language models paved the way, large speech models are rapidly catching up. I expect we’ll see many more large speech models in the near future.
Visual Instruction Tuning
Another neat idea that crossed my path is visual instruction tuning. I was trying to understand the best way to train an image classification model, though I’ve never worked with image models before. As we’ve seen success with using LLMs for NLP classification tasks, it makes sense that image classification tasks could be done with a multimodal LLM. This particular setup uses CLIP to represent images, and Llama-2 for the text. The image model is fixed, but visual instruction tuning updates both the LLM weights and a projection between the image and language models. The result is a chatbot you can talk to about an image. The paper is an oral presentation at Neurips this month, so clearly others thought it was interesting too.
New LLMs
Another multi-modal model released this month was Google’s long-awaited Gemini model. It combines text, image, video, audio and code, and comes in 3 sizes. Google’s results show that their largest model (‘ultra’) outperforms GPT-4 on several common benchmarks. Still, it’s hard to verify these results without direct access to the model. Meanwhile Mistral open-sourced their latest model with no fanfare.
Trust in synthetic media
As multimodal models get more capable, it’s becoming important that we understand people’s reaction to AI-generated content. Two recent studies do just that in different contexts - one for copywriting, and another for news. The first found that people don’t mind AI generated copy, while the second shows that AI-generated news stories can be seen as less trustworthy. Understanding reactions to synthetic media in different contexts is going to be really important to companies wanting to generate synthetic content.
EU AI Act Agreement
After a long discussion, the EU have agreed on the shape of their AI Act, though the wording is still to be hashed out. The act would ban some applications of AI outright, and place obligations on companies developing “high-risk” applications. It won’t, however, apply to defence or military applications, or research-only AI.
AI for Science
This month I was at Cambridge’s AI for Science Summit where I got to chair two panels and also learn about the multitude of ways that researchers are using AI across disciplines. Turns out that there are plenty of challenges for researchers to overcome! Researchers find it tough to access the right AI/ML and software expertise that complements their domain expertise. And because this combination of expertise is needed, AI for Science work often ends up as interdisciplinary projects - these don’t always align nicely with academic incentives. Data sharing and standardisation also differs wildly across disciplines, having a direct impact on how easy it is for researchers to use AI. Despite these challenges, I was very impressed by the range of projects on show. From predicting waves breaking to bee health and detecting drought stress in plant leaves, there were so many creative ideas and I’m looking forward to seeing where they all go!
What I’ve been reading
The Art of Framing [book]
The limits of the Mean Opinion Score for speech synthesis evaluation
Meta’s new AI image generator was trained on 1.1 billion Instagram and Facebook photos
IBM, Meta form “AI Alliance” with 50 organizations to promote open source AI
Artificial Intelligence Act: deal on comprehensive rules for trustworthy AI
Wishing you a happy New Year, see you in 2024!