English|Dutch

Professor of Language, Communication and Computation at Utrecht University.

Board member and domain chair Social Sciences and Humanities, NWO — Dutch Research Council.

a.p.j.vandenbosch@uu.nl

visiting
Utrecht University, Faculty of Humanities
Trans 10
3512 JK Utrecht
The Netherlands

postal
Utrecht University, Faculty of Humanities
P.O. Box 80125
3508 TC Utrecht
The Netherlands

I study how computers learn language — and what that means for the rest of us. My group builds the kind of AI that powers chatbots like ChatGPT, but we also ask harder questions: can we make it greener, more honest about where its answers come from, and easier for society to govern?

On this page you'll find my current research, a few reads, the software we created, the projects we run, and the people I work with. For the formal record, see

Interactive publications/b>

Two illustrated, interactive essays written for a broad audience — the easiest way into what we do:

Catching Cyberbullies with Neural Networks — how AI can help moderate online harassment. Wessel Stoop, Florian Kunneman, Antal van den Bosch, Ben Miller. The Gradient, 2021.

How Algorithms Know What You'll Type Next — a playful explainer on the technology behind autocomplete. Wessel Stoop, Antal van den Bosch. The Pudding, 2019.

What I work on

At heart, my research is about one deceptively simple task: predicting the next word. That is the trick at the core of every modern language model, including ChatGPT. It sounds mundane, but it turns out you can squeeze an astonishing amount of language, reasoning and culture out of a really good next-word predictor.

My work sits at the intersection of machine learning — teaching computers from examples — and language. The models we build are useful in their own right, but they also speak to deeper questions in linguistics, psychology and neuroscience about how humans handle language. I particularly enjoy projects that cross disciplines.

Olifant: a lighter alternative to ChatGPT-style AI

Today's chatbots are built on Transformers, an architecture that is powerful but extremely hungry for electricity. With Maarten van Gompel, Peter Berck, Ko van der Sloot and a growing team of students, I am reviving an older, simpler idea — memory-based language models — and showing it can do much of the same work for a fraction of the energy.

The numbers are dramatic: training our models uses roughly 1,000× less electricity than training a Transformer of similar capability, and answering questions costs 10 to 100× less. They scale up nicely when given more data, and — unlike mainstream language models — they can tell you which examples from their training data a given answer is based on. That is a feature, not a bug: it makes the model's behaviour easier to inspect and trust.

We released the software as Olifant (Dutch for "elephant"), wrote up the method in a paper, and are sharing demos and downloadable models.

From the Olifant paper: estimated CO2 emissions (measured with CodeCarbon) for letting different language models predict the next words in a 500,000-token text. The dashed reference lines show everyday emissions for comparison — a washing-machine run, ten minutes of driving, a litre of milk. Two of our models stay well below the washing machine, while larger GPT-style systems climb steeply.

Try it yourself: the demo below shows, for any sentence you type, which earlier examples Olifant is drawing on to predict the next word. There's also a text generation demo and a collection of Olifant models on Hugging Face.

Advising on AI policy

Because I build these systems, I am often asked what to do about them. With colleagues Fabian Ferrari and José van Dijck I advise governments and academic bodies on how generative AI should be regulated. Sensible policy needs both the technical and the governance side at the table. Read more in the UU news item or in our Nature Machine Intelligence paper.

Publications & books

The full list lives on the publications page; you can also browse my Google Scholar or Semantic Scholar profiles.

With Walter Daelemans I wrote Memory-based language processing (Cambridge University Press, 2005), which lays the groundwork for the Olifant line of research above.

       

Edited volumes

I co-edited volumes on Arabic computational morphology, interactive multi-modal question answering, and language technology for cultural heritage.

Affiliations

Since September 2022 I have been professor of Language, Communication and Computation at the Faculty of Humanities, Utrecht University. Since July 2023 I serve on the Board of NWO, the Dutch Research Council, as domain chair for Social Sciences and Humanities. I am a guest professor at CLiPS, University of Antwerp.

I am a fellow of the European Association for Artificial Intelligence (EurAI) and a member of the Royal Netherlands Academy of Arts and Sciences and the Koninklijke Hollandsche Maatschappij der Wetenschappen.

Earlier positions

2017–2022: director of the Meertens Institute (Royal Netherlands Academy). Before that, professor of Language and Speech Technology at Radboud University Nijmegen, within the Centre for Language Studies and the Centre for Language and Speech Technology. 1997–2011: at Tilburg University (ILK Research Group). PhD at Maastricht University.

Public talks

In the Netherlands and Belgium, newly appointed professors traditionally give a public inaugural lecture. I have had the privilege three times — at Tilburg (2008), Radboud Nijmegen (2012), and most recently at KU Leuven KULAK in 2023, when I held the Francqui Chair on Language and AI.

All three talks circle the same surprising idea: predicting the next word is enough to get you a long way. Already in 2008 I argued that next-word predictors just keep getting better the more text you train them on — ten times more data buys you a steady, predictable bump in accuracy. That observation (Heaps' Law, if you want the technical name) quietly underlies the entire boom in large language models we are living through.

The lectures themselves are in Dutch:

Past projects

Show research projects I previously led or co-led

  • ADNEXT — Adaptive Information Extraction over Time, part of the COMMIT programme.
  • Language in Interaction — with Peter Desain, I coordinated WP7 'Utilization' of this NWO Gravitation programme.
  • DISCOSUMO — NWO Creative Industry project with Tilburg University and Sanoma.
  • TraMOOC — a Horizon 2020 project on machine translation for Massive Open Online Courses.
  • Notoriously Toxic — an NEH project on the language and costs of online harassment in games.
  • FACT — Folktales as Classifiable Texts, an NWO CATCH project.
  • Tunes & Tales — a KNAW Computational Humanities project.
  • HiTiME — Historical Timeline Mining and Extraction, an NWO CATCH project.
  • MEMPHIX — memory-based paraphrasing.
  • Implicit Linguistics — an NWO Vici project on machine learning of text-to-text processing.
  • AMICUS — an NWO Internationalisation in the Humanities project on motif discovery in cultural heritage texts.

Current projects

Better-Mods develops tools to help citizens get a clearer picture of debates happening on online platforms. Together with nu.nl and the TULP group at Tilburg, funded by NWO. A sister project, also NWO-funded, produced Wie Is De Trol? (who is the troll?) — a classroom game about online manipulation — with partners NEMO Kennislink, KNAW Meertens Instituut, Netwerk Mediawijsheid, and Alliantie Digitaal Samenleven.

Cultural AI is a lab for "culturally aware" AI — AI systems that take seriously how subtle, plural and contested human culture is, rather than flattening it.

Teaching

Transformers: Applications in Language and Communication — block 3 of the Applied Data Science Master at Utrecht University.

This course introduces students to the Transformer — the T in GPT — the workhorse architecture that, since ChatGPT was released in late 2022, has reshaped AI overnight. At its core the Transformer does just one thing: predict the next word. So how does a next-word predictor pass exams? How does it appear to reason? Where does it fail, and where is the field heading next? Those are the questions we work through.

PhD students I'm currently supervising

  • Shane Kaszefski Yashuk (with Massimo Poesio and Dong Nguyen)
  • Golshid Shekoufandeh (with Paul Boersma)
  • Joris Veerbeek (with Karin van Es and Mirko Schäfer)
  • Xiao Xu (with Anne Gauthier and Gert Stulp)
  • Ronja van Zijverden (with Marloes van Moort, Karin Fikkers, and Hans Hoeken)

Former PhD students

2020s

2010s

2000s

  • Toine Bogers, Aalborg University Copenhagen
  • Sabine Buchholz, Capito Systems
  • Sander Canisius, Netherlands Cancer Institute
  • Iris Hendrickx, Radboud University
  • Piroska Lendvai, Bavarian Academy of Science and Arts
  • Laura Maruster, University of Groningen
  • Stephan Raaijmakers, TNO and Leiden University
  • Martin Reynaert, University of Amsterdam

Software we have released

Beside papers, our projects produce software. Where we can, we release it under open-source licenses; some packages also run as webservices with a friendly interface, and several have been folded into national and European research infrastructure.

A few highlights:

  • Olifant — the memory-based language model from the section above. Built on a long line of earlier work, including the WOPR system from 2010 that — with much less data and far smaller computers — already predicted next words for fun. With Peter Berck, Maarten van Gompel, Ko van der Sloot, Teun Buijse and Ainhoa Risco Paton.
  • Frog — an all-in-one tool for analysing Dutch text: it identifies parts of speech, finds word stems, and untangles sentence structure. With the Frog development team.
  • T-Scan (also a web tool) — analyses Dutch text for features that correlate with reading difficulty.
  • Timbl — the Tilburg Memory-Based Learner. The classic machine-learning toolkit that powers much of our memory-based work. With Ko van der Sloot, Walter Daelemans and Jakub Zavrel.

More software and digital infrastructure

Natural language processing

  • Valkuil.net and Fowlt.net — context-sensitive spelling correctors for Dutch and English.
  • Colibri Core — efficient n-gram and skip-gram modelling. With Maarten van Gompel.
  • Mbt — memory-based part-of-speech tagger. With Ko van der Sloot, Jakub Zavrel and Walter Daelemans.


WOPR in 2010 generating "word salad" — what a tiny precursor to ChatGPT looked like.

Digital research infrastructure

  • CLARIAH — Common Lab Research Infrastructure for the Arts and Humanities.
  • Nederlab — brings together digitised Dutch texts from the Middle Ages to today in one searchable interface (funded by NWO).
  • FutureTDM — a Horizon 2020 action on text and data mining.
  • TwiNL — a Netherlands eScience Center project with Erik Tjong Kim Sang.
  • ISHER — Integrated Social History Environment for Research.

Selected media

English

Dutch

Games & consumer products

a.p.j.vandenbosch@uu.nl