July Roundup

Jul 29, 2025

Hello friends!

Welcome to the July roundup - your monthly dose of papers, events, and other interesting bits I’ve been reading lately. I also share over on LinkedIn and Bluesky, where you can follow too.

Introducing Lichen AI

Exciting news! With Diana Robinson, Tabitha Goldstaub, we're launching Lichen AI. We’re building an AI ethnographer that maps the unspoken knowledge in your organisation—the shortcuts, workarounds, and cultural insights that make teams effective—then designs AI systems that work with this reality, not against it.

I’ve learned that most AI implementations fail because they ignore the messy, human reality of how work actually gets done. To overcome this, our approach brings together scalable methods from human-centred interaction design, conversational AI, product management, and tech strategy.

If you’d like to learn more, email me or click to book a call and schedule a time that works for you.

Articles & other Links

Career Spotlight

I know solid career advice can be hard to come by in the AI world, which is exactly why I started this spotlight series. The idea is to highlight a range of career paths in AI and hopefully spark a bit of inspiration, by sharing how some brilliant people have shaped their journeys.

This month’s spotlights:

Tey Bannerman, Partner and AI Implementation Leader at McKinsey & Company
Andreas Damianou, Senior Research Manager at Spotify

Podcast

Podcast episodes from this month

Paul Wilson from STAC on bringing some of the Silicon Valley attitude to Scotland, and building up the tech ecosystem in Glasgow:

Bhagya Reddy from Virgin Media discusses the critical importance of data literacy across organisations, and how data skills can empower career transitions:

Papers I’ve read

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Find the paper: https://arxiv.org/abs/2506.01732v1

What does it take to build an ethical corpus for LLM training?

Current methods for training language models like ChatGPT rely on indiscriminately scraping internet data, including copyrighted books, articles, and code. This raises legal and ethical concerns, with legal challenges playing out in court.

This paper introduces Common Corpus: the first massive dataset built entirely from legally permissible sources. That’s data which is uncopyrighted or available under specific permissive licenses. The corpus is just over 1 trillion words, or 2 trillion tokens, spread across subgenres like government, science, web and code. For comparison, Llama 2 was trained on 2 trillion tokens, and Llama 4 on 30 trillion tokens.

The creators have put significant effort into cleaning up the data - fixing OCR errors from old books, removing harmful or biased content, and protecting people’s privacy - so that the resulting AI systems built from it are both reliable and respectful.

The authors point out that uncovering open data is not easy, as much is hidden from view or difficult to source, and they continue to add to the set.

Thanks for reading! See you next month,

Catherine.

AI x Insights

Discussion about this post

Ready for more?