Antal van den Bosch: Homepage

Professor of Language, Communication and Computation at Utrecht University

Board member and domain chair Social Sciences and Humanities, NWO | Dutch Research Council

a.p.j.vandenbosch@uu.nl

visiting
Utrecht University, Faculty of Humanities
Trans 10
3512 JK Utrecht
The Netherlands

postal
Utrecht University, Faculty of Humanities
P.O. Box 80125
3508 TC Utrecht
The Netherlands

Jump to publications page >>

Click and explore the following interactive publications:

Catching Cyberbullies with Neural Networks, Wessel Stoop, Florian Kunneman, Antal van den Bosch, Ben Miller. The Gradient, 2021

How Algorithms Know What You'll Type Next, Wessel Stoop, Antal van den Bosch. The Pudding, 2019

Research interests

In my research I develop machine learning and language technology. Most of my work involves the intersection of the two fields: computers that learn to understand and generate natural language, nowadays known as Generative AI and Large Language Models. The computational models that this work produces have all kinds of applications in other areas of scholarly research as well as in society and industry. They also link in interesting ways to theories and developments in linguistics, psycholinguistics, neurolinguistics, and sociolinguistics. I love multidisciplinary collaborations to make advances in all these areas.

My latest work with students and with old-time team members Maarten van Gompel, Peter Berck, and Ko van der Sloot is to revive a specific technology, memory-based language models, as an eco-friendly alternative to the energy-consuming Transformer architecture used in present-day chatbots. The electricity consumption of training our models is an estimated 1000 times lower than training Transformers. During inference, they consume 10 to 100 times less electricity. They scale well and get better (and hardly slower) when trained on more data. They also intentionally memorize their training data with reasonable accuracy. We released a software package called Olifant and we wrote a paper, and are developing demos and downloadable models.

Image from the Olifant paper showing estimated CO2 emissions (based on electricity usage monitored by CodeCarbon) of having our Olifant models predict the next tokens in a 500-thousand token validation text. The graph includes emission estimates for the old GPT-2 models and the slightly newer GPT-Neo. The x-axis is logarithmic and shows the amount of training tokens the models were trained on. The graph also shows the amount of CO2 emitted after a washing machine run, a family car driving for 10 minutes, a tumble dryer run, producing 1 kg of steel, and producing 1 l of cow milk. Two of our models stay well below the washing machine run, while larger GPT systems cost increasingly more electricity.

The following demo (a HuggingFace Space) exhibits Olifant's ability to show which nearest neighbors a next-token prediction is based. Scroll to 'Enter text context' to enter sample input. Also check out another Olifant HuggingFace Space showcasing autoregressive text generation by a larger Olifant model. A list of Olifant models trained on educational materials and instruction texts in various sizes is also available on HuggingFace.

I am also involved in consulting academic and governmental bodies on the implications of generative AI, including the types of large language models I have been developing and working with. Answering the question how to regulate these technologies requires synergistic collaborations between AI and governance researchers. Read more on this topic: here and here - with colleagues Fabian Ferrari and José van Dijck.

My CV has more detailed information. Also, check out my blog for some loose thoughts and sketches.

Publications

Scientific publications
See also Google Scholar Citations and Semantic Scholar

Walter Daelemans and I published a book entitled Memory-based language processing (Cambridge University Press, 2005).

I co-edited volumes on Arabic computational morphology, interactive multi-modal question answering, and language technology for cultural heritage.

Affiliations

Since September 2022 I work at the Faculty of Humanities at Utrecht University as professor in Language, Communication and Computation. On July 1, 2023 I joined the Board of NWO, the Dutch Research Council, as domain chair Social Sciences and Humanities.

From 2017 to 2022 I was director of the Meertens Institute, an institute of the Royal Netherlands Academy for Arts and Sciences. Before that, I was professor of Language and Speech Technology at Radboud University, Nijmegen, the Netherlands, within the Centre for Language Studies (CLS) and the the Centre for Language and Speech Technology of the Faculty of Arts. From 1997 to 2011 I worked at what is now called the Tilburg School of Humanities and Digital Sciences of Tilburg University, in the ILK Research Group. I obtained my PhD at Maastricht University at what is now the Department of Advanced Computing Sciences.

I am guest professor at CLiPS, the Computational Linguistics, Psycholinguistics and Sociolinguistics Research Centre at the University of Antwerp. I am fellow of the European Association for Artificial Intelligence (EurAI), member of the Royal Netherlands Academy of Arts and Sciences, and of the Koninklijke Hollandsche Maatschappij der Wetenschappen.

Public Talks

It is customary for newly appointed professors in the Netherlands to give a public lecture at the onset of a new professorship. I had the privilege to do so at Tilburg University (2008) and Radboud University, Nijmegen (2012). The same happened in 2023 when I was honored to be awarded the KULAK Francqui Leerstoel 2023 on Language and AI at KULAK, Leuven University campus Kortrijk.

All three public lectures evolve around my amazement with the deceptively mundane task of predicting the next word, the magical trick task that makes GPT-based language models work. A point I made in 2008 is that data-driven word predictors continue to be more accurate when trained on more data. This relation is log-linear: every ten-fold increase of training data leads to an approximately constant improvement in terms of the percentage of correctly predicted words. This is expected given Heaps' Law.

The lectures were in Dutch:

1001 Manieren om hetzelfde te zeggen: Computationele modellen van taalvariatie , inaugural address, Leuven University, KULAK, February 23, 2023

Taal in uitvoering, inaugural address, Radboud University, November 9, 2012

Het volgende woord, inaugural address, Tilburg University, October 10, 2008 (animations shared on YouTube; this animation below showing word sequences in Dutch where the next word is entirely predictable)

Past projects & highlights

Research projects

ADNEXT, Adaptive Information Extraction over Time, a work package of the Infiniti project, part of COMMIT.

Language in Interaction - with Peter Desain I coordinated WP7 'Utilization' of this NWO Gravitation programme.

DISCOSUMO, an NWO Creative Industry project with Tilburg centre for Cognition and Communication, Tilburg University, and Sanoma.

TraMOOC, a Horizon 2020 ICT collaborative project aiming at providing reliable machine Translation for Massive Open Online Courses (MOOCs).

Notoriously Toxic, a National Endownment for the Humanities project on understanding the language and costs of hate and harassment in online games.

FACT, Folktales as Classifiable Texts, an NWO CATCH project

Tunes & Tales, a KNAW Computational Humanities project

HiTiME, Historical Timeline Mining and Extraction, an NWO CATCH project

MEMPHIX, Memory-based paraphrasing using implicit and explicit semantics

Implicit Linguistics, Machine Learning of Text-to-Text Processing, an NWO Vici project.

AMICUS, Automated Motif Discovery in Cultural Heritage and Scientific Communication Texts, an NWO Internationalisation in the Humanities project

Current projects

Better-Mods, a research project developing new tools for citizens to allow them to be informed better about current debates on online platforms. A collaboration with nu.nl and the TULP research group at Tilburg University, funded by NWO. Aligned with this project, also funded by NWO (NWA Wetenschapscommunicatie) and with partners NEMO Kennislink, KNAW Meertens Instituut, Netwerk Mediawijsheid, and Alliantie Digitaal Samenleven, the classroom game Wie Is De Trol? (who is the troll?) was developed.

Cultural AI, a lab for culturally valued AI. Cultural AI is the study, design and development of socio-technological AI systems that are implicitly or explicitly aware of the subtle and subjective complexity of human culture.

Courses

Transformers: Applications in Language and Communication, block 3, Applied Data Science Master, Utrecht University

This course introduces you to Transformers. With the release of ChatGPT in November 2022, the Transformer, the T in GPT, firmly took position as the number one working horse in AI. It changed the field of natural language processing overnight, and its revolutionary industrial impact might be huge and lasting. Yet, the working horse at its core can only do one thing: predict the next word. How can a next-word predictor show good aptitude at school tests? How is it able to reason? What are the problematic aspects of this technology, and what are the current developments?

Current (co-) supervised Ph.D. students

Shane Kaszefski Yashuk (with Massimo Poesio and Dong Nguyen)

Golshid Shekoufandeh (with Paul Boersma)

Joris Veerbeek (with Karin van Es and Mirko Schäfer)

Xiao Xu (with Anne Gauthier and Gert Stulp)

Ronja van Zijverden (with Marloes van Moort, Karin Fikkers, and Hans Hoeken)

Former (co-) supervised Ph.D. students

Defended in the 2020s

Martijn Bentum, Radboud University

Ryan Brate

Lucas van der Deijl, University of Groningen

Maarten van Gompel, Digital Infrastructure, KNAW Humanities Cluster

Alessandro Lopopolo, Universität Potsdam

Roel Smeets, Radboud University

Eric Sanders, Radboud University

Robbert De Troij, Leuven University

Chara Tsoukala, Athena Research and Innovation Center

Cedric Waterschoot, Maastricht University

Jinbiao Yang, Max Planck Institute for Psycholinguistics

2010s

Sara Ahmadi, Radboud University

Peter Berck, Lund University, Sweden

Matje van de Camp, De Taalmonsters

Marieke van Erp, Digital Humanities Lab, KNAW Humanities Cluster

Ali Hürriyetoğlu, Digital Humanities Lab, KNAW Humanities Cluster

Folgert Karsdorp, KNAW Meertens Instituut

Florian Kunneman, Vrije Universiteit Amsterdam

Maria Mos, Tilburg University

Dong Nguyen, Utrecht University

Herman Stehouwer, Valcon

Sander Wubben

2000s

Toine Bogers, Aalborg University Copenhagen, Denmark

Sabine Buchholz, Capito Systems, UK

Sander Canisius, The Netherlands Cancer Institute

Iris Hendrickx, Radboud University

Piroska Lendvai, Bavarian Academy of Science and Arts

Laura Maruster, University of Groningen

Stephan Raaijmakers, TNO and Leiden University

Martin Reynaert, University of Amsterdam

Software and Digital Infrastructure

Aside from papers and dissertations, our projects tend to produce software. We make a point of maximizing the availability of this software by releasing the best software projects under open source licenses. Some of our software, particularly the packages that perform some natural language processing function, are available as webservices, usually with a web interface. Much of this software has been integrated in larger digital research infrastructures at the national and European level.

As part of past and ongoing projects with many colleagues I was involved in developing the following software and digital research infrastructure:

Natural Language Processing

Frog: Dutch tagger-lemmatizer, morphological analyzer, and dependency parser. With the Frog development team.

Olifant: Memory-based language modeling. With Peter Berck, Maarten van Gompel, Ko van der Sloot, Teun Buijse, Ainhoa Risco Paton. Described in this this paper. Olifant is based on older work on memory-based spelling and grammar correction with WOPR. Below is a screenshot from March 2010 in which you see WOPR in generative mode, predicting "Word Salads": consecutive next words based on an initial prompt. This is how predecessors of ChatGPT looked like at a time when we had little data and limited high-performance computing facilities.

T-Scan (also available as web interface and web service): Dutch text analyzer for computing text features indicative of text complexity. With Rogier Kraf, Henk Pander Maat, Ko van der Sloot, Maarten van Gompel, Florian Kunneman, Susanne Kleijn, Ted Sanders, Maarten van der Klis, Sheean Spoel, Luka van der Plas.

Valkuil.net and Fowlt.net: Dutch and English context-sensitive spelling correctors. With Maarten van Gompel, Wessel Stoop, Tanja Gaustad van Zaanen, and Monica Hajek.

Colibri Core: Efficient n-gram and skipgram modeling. With Maarten van Gompel.

Mbt: Memory-based tagger-generator and tagger. With Ko van der Sloot, Jakub Zavrel, and Walter Daelemans.

Machine Learning

Timbl: Tilburg memory-based learner. With Ko van der Sloot, Walter Daelemans, and Jakub Zavrel.

Digital Research Infastructure

CLARIAH, Common Lab Research Infrastructure for the Arts and Humanities.

Nederlab, bringing together massive amounts of digitized Dutch texts from the Middle Ages to the present in one user-friendly and tool-enriched web interface. Funded by NWO.

FutureTDM, a Horizon 2020 coordinate and support action on reducing barriers and increasing uptake of text and data mining for research environments.

TwiNL, A deeper understanding of society, a Pathfinding project of the Netherlands eScience Center. With Erik Tjong Kim Sang.

ISHER, Integrated Social History Environment for Research: Digging into Social Unrest, a Digging Into Data project.

Selected media items

English media items:

UU researchers advise government on regulation of ChatGPT and other generative AI, Utrecht University, September 2023

Dutch media items:

Mogelijk verbod op AI voor ambtenaren, Universiteit Utrecht, november 2023

UU-onderzoekers adviseren overheid over regulering ChatGPT en andere generatieve AI, Universiteit Utrecht september 2023

De toverkracht van AI. Els Aarts, Studium Generale, Universiteit Utrecht 14 juli 2023.

De Jortcast: Wordt Jort vervangen door een computer?. Podcastserie "De Jortcast", AvroTros. 2 februari 2023

Waarom AI de Nederlandse taal nog niet voldoende beheerst. Podcastserie "BNR Eyeopeners", BNR Nieuwsradio, September 2022

Twitterfilter heeft moeite met het herkennen van meningen, Mathilde Jansen, Kennislink, April 2021

Kunstmatige intelligentie kan moderatie van online reacties ondersteunen, Britte Schilt, Kennislink, January 2021

Hoe ontcijfer je een onbekende boodschap? Emmeke Bos, Quest, June 2020

Het geheim van het Voynich-manuscript. Robin de Wever, Trouw, July 2018

Vertalen met neurale netwerken. Steven Beek, Kennislink, August 2017

Google Translate in het onderwijs? Erica Renckens, Kennislink, October 2014

Culturomics: een nieuw vakgebied? - opinion piece (in Dutch) on the Google Books Ngram corpus and the Science article in e-data & research. March 1, 2011

"Hacktivist", column (in Dutch) by Antal van den Bosch on pop sci website Kennislink.nl. December 16, 2010

Article on SMS texting with T9, featuring our work on context-sensitive text completion, on pop sci website Kennislink (in Dutch). May 19, 2010

Games

Ubisoft releases the brain training game My Word Coach (Dutch), a vocabulary booster for Nintendo DS and Wii. I did the background work on the Dutch localization in collaboration with Walter Daelemans (University of Antwerp). News covered (in Dutch) by ComputerIdee and De Pers; an item at Radio 1 show Radio Online; Tilburg University press release; articles in Kennislink, Brabants Dagblad, Volkskrant. Game review at Gamespot. April 8, 2008

a.p.j.vandenbosch@uu.nl