|
I study how computers learn language — and what that means for the rest of us.
My group builds the kind of AI that powers chatbots like ChatGPT,
but we also ask harder questions: can we make it greener, more honest about
where its answers come from, and easier for society to govern?
On this page you'll find my current research, a few reads,
the software we created, the projects we run, and the people I work with.
For the formal record, see
Interactive publications/b>
Two illustrated, interactive essays written for a broad audience —
the easiest way into what we do:
Catching Cyberbullies with Neural Networks
— how AI can help moderate online harassment.
Wessel Stoop, Florian Kunneman, Antal van den Bosch, Ben Miller. The Gradient, 2021.
How Algorithms Know What You'll Type Next
— a playful explainer on the technology behind autocomplete.
Wessel Stoop, Antal van den Bosch. The Pudding, 2019.
What I work on
At heart, my research is about one deceptively simple task: predicting the next word.
That is the trick at the core of every modern language model, including ChatGPT.
It sounds mundane, but it turns out you can squeeze an astonishing amount of language,
reasoning and culture out of a really good next-word predictor.
My work sits at the intersection of machine learning — teaching computers from
examples — and language. The models we build are useful in their own right,
but they also speak to deeper questions in linguistics, psychology and neuroscience
about how humans handle language. I particularly enjoy projects that cross disciplines.
Olifant: a lighter alternative to ChatGPT-style AI
Today's chatbots are built on Transformers, an architecture that is powerful
but extremely hungry for electricity. With Maarten van Gompel, Peter Berck, Ko van der Sloot
and a growing team of students, I am reviving an older, simpler idea — memory-based
language models — and showing it can do much of the same work for a fraction
of the energy.
The numbers are dramatic: training our models uses roughly 1,000× less electricity
than training a Transformer of similar capability, and answering questions costs
10 to 100× less. They scale up nicely when given more data, and — unlike
mainstream language models — they can tell you which examples from their training
data a given answer is based on. That is a feature, not a bug: it makes the model's
behaviour easier to inspect and trust.
We released the software as Olifant
(Dutch for "elephant"), wrote up the method in a paper,
and are sharing demos and downloadable models.
From the Olifant paper: estimated CO2
emissions (measured with CodeCarbon) for letting different language models predict the next words
in a 500,000-token text. The dashed reference lines show everyday emissions for comparison
— a washing-machine run, ten minutes of driving, a litre of milk. Two of our models
stay well below the washing machine, while larger GPT-style systems climb steeply.
Try it yourself: the demo below shows, for any sentence you type,
which earlier examples Olifant is drawing on to predict the next word.
There's also a text generation demo
and a collection of Olifant models
on Hugging Face.
Advising on AI policy
Because I build these systems, I am often asked what to do about them.
With colleagues Fabian Ferrari
and José van Dijck I advise
governments and academic bodies on how generative AI should be regulated.
Sensible policy needs both the technical and the governance side at the table.
Read more in the
UU news item
or in our Nature Machine Intelligence paper.
Publications & books
The full list lives on the publications page;
you can also browse my
Google Scholar
or Semantic Scholar profiles.
With Walter Daelemans I wrote
Memory-based language processing
(Cambridge University Press, 2005), which lays the groundwork for the Olifant line of research above.
Edited volumes
I co-edited volumes on
Arabic computational morphology,
interactive multi-modal question answering,
and language technology for cultural heritage.
Affiliations
Since September 2022 I have been professor of Language, Communication and Computation at the
Faculty of Humanities,
Utrecht University.
Since July 2023 I serve on the Board of NWO, the Dutch Research Council,
as domain chair for Social Sciences and Humanities.
I am a guest professor at CLiPS,
University of Antwerp.
I am a fellow of the European Association for Artificial Intelligence
(EurAI) and a member of the
Royal Netherlands Academy of Arts and Sciences
and the Koninklijke Hollandsche Maatschappij der Wetenschappen.
Earlier positions
2017–2022: director of the Meertens Institute
(Royal Netherlands Academy).
Before that, professor of Language and Speech Technology at Radboud University Nijmegen,
within the Centre for Language Studies
and the Centre for Language and Speech Technology.
1997–2011: at Tilburg University
(ILK Research Group).
PhD at Maastricht University.
Public talks
In the Netherlands and Belgium, newly appointed professors traditionally give a
public inaugural lecture. I have had the privilege three times — at Tilburg (2008),
Radboud Nijmegen (2012), and most recently at KU Leuven KULAK
in 2023, when I held the
Francqui Chair on Language and AI.
All three talks circle the same surprising idea: predicting the next word
is enough to get you a long way. Already in 2008 I argued that next-word predictors
just keep getting better the more text you train them on — ten times more
data buys you a steady, predictable bump in accuracy. That observation
(Heaps' Law, if you want the technical name)
quietly underlies the entire boom in large language models we are living through.
The lectures themselves are in Dutch:
Past projects
Show research projects I previously led or co-led
- ADNEXT — Adaptive Information Extraction over Time, part of the COMMIT programme.
- Language in Interaction — with Peter Desain, I coordinated WP7 'Utilization' of this NWO Gravitation programme.
- DISCOSUMO — NWO Creative Industry project with Tilburg University and Sanoma.
- TraMOOC — a Horizon 2020 project on machine translation for Massive Open Online Courses.
- Notoriously Toxic — an NEH project on the language and costs of online harassment in games.
- FACT — Folktales as Classifiable Texts, an NWO CATCH project.
- Tunes & Tales — a KNAW Computational Humanities project.
- HiTiME — Historical Timeline Mining and Extraction, an NWO CATCH project.
- MEMPHIX — memory-based paraphrasing.
- Implicit Linguistics — an NWO Vici project on machine learning of text-to-text processing.
- AMICUS — an NWO Internationalisation in the Humanities project on motif discovery in cultural heritage texts.
|