July Roundup
Hello friends!
Welcome to the July roundup - your monthly dose of papers, events, and other interesting bits I’ve been reading lately. I also share over on LinkedIn and Bluesky, where you can follow too.
Introducing Lichen AI
Exciting news! With Diana Robinson, Tabitha Goldstaub, we're launching Lichen AI. We’re building an AI ethnographer that maps the unspoken knowledge in your organisation—the shortcuts, workarounds, and cultural insights that make teams effective—then designs AI systems that work with this reality, not against it.
I’ve learned that most AI implementations fail because they ignore the messy, human reality of how work actually gets done. To overcome this, our approach brings together scalable methods from human-centred interaction design, conversational AI, product management, and tech strategy.
If you’d like to learn more, email me or click to book a call and schedule a time that works for you.
Articles & other Links
Career Spotlight
I know solid career advice can be hard to come by in the AI world, which is exactly why I started this spotlight series. The idea is to highlight a range of career paths in AI and hopefully spark a bit of inspiration, by sharing how some brilliant people have shaped their journeys.
This month’s spotlights:
Tey Bannerman, Partner and AI Implementation Leader at McKinsey & Company
Andreas Damianou, Senior Research Manager at Spotify
Podcast
Podcast episodes from this month
Paul Wilson from STAC on bringing some of the Silicon Valley attitude to Scotland, and building up the tech ecosystem in Glasgow:
Bhagya Reddy from Virgin Media discusses the critical importance of data literacy across organisations, and how data skills can empower career transitions:
Papers I’ve read
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Find the paper: https://arxiv.org/abs/2506.01732v1
What does it take to build an ethical corpus for LLM training?
Current methods for training language models like ChatGPT rely on indiscriminately scraping internet data, including copyrighted books, articles, and code. This raises legal and ethical concerns, with legal challenges playing out in court.
This paper introduces Common Corpus: the first massive dataset built entirely from legally permissible sources. That’s data which is uncopyrighted or available under specific permissive licenses. The corpus is just over 1 trillion words, or 2 trillion tokens, spread across subgenres like government, science, web and code. For comparison, Llama 2 was trained on 2 trillion tokens, and Llama 4 on 30 trillion tokens.
The creators have put significant effort into cleaning up the data - fixing OCR errors from old books, removing harmful or biased content, and protecting people’s privacy - so that the resulting AI systems built from it are both reliable and respectful.
The authors point out that uncovering open data is not easy, as much is hidden from view or difficult to source, and they continue to add to the set.
Thanks for reading! See you next month,
Catherine.