In my research I develop machine learning and language
technology. Most of my work involves the intersection of the two
fields: computers that learn to understand and generate natural
language, nowadays known as Generative AI and Large Language Models.
The computational models that this work produces have all kinds of applications
in other areas of scholarly research as well as in society and industry. They
also link in interesting ways to theories and developments in linguistics,
psycholinguistics, neurolinguistics, and sociolinguistics. I love
multidisciplinary collaborations to make advances in all these areas.
My latest work with students and with old-time team members Maarten van Gompel,
Peter Berck, and Ko van der Sloot
is to revive a specific technology, memory-based language models, as an eco-friendly
alternative to the energy-consuming Transformer architecture used in present-day
chatbots. The electricity
consumption of training our models is an estimated 1000 times lower than
training Transformers. During inference, they consume 10 to 100 times less
electricity. They scale well and get better (and hardly slower) when trained
on more data. They also intentionally memorize their training data with
reasonable accuracy. We released
a software package called Olifant
and we wrote a paper, and
are developing demos and downloadable models.
Image from the Olifant paper showing estimated CO2
emissions (based on electricity usage monitored by CodeCarbon) of having our Olifant models predict
the next tokens in a 500-thousand token validation text. The graph includes emission estimates for
the old GPT-2 models and the slightly newer GPT-Neo. The x-axis is logarithmic and shows the amount
of training tokens the models were trained on. The graph also shows the amount of CO2 emitted after
a washing machine run, a family car driving for 10 minutes, a tumble dryer run, producing 1 kg of
steel, and producing 1 l of cow milk. Two of our models stay well below the washing machine run,
while larger GPT systems cost increasingly more electricity.
I am also involved in consulting academic and governmental bodies on the
implications of generative AI, including the types of large language models I have
been developing and working with. Answering the question how to regulate
these technologies requires synergistic collaborations between AI and
governance researchers. Read more on this topic:
here
and here - with colleagues
Fabian Ferrari and
José van Dijck.
My CV has more
detailed information. Also, check out my blog
for some loose thoughts and sketches.
It is customary for newly appointed professors in the Netherlands to give a public lecture at the
onset of a new professorship.
I had the privilege to do so at Tilburg University (2008) and Radboud University, Nijmegen (2012).
The same happened in 2023 when I was honored to be awarded the
KULAK Francqui Leerstoel 2023 on Language and AI at KULAK,
Leuven University campus Kortrijk.
All three public lectures evolve around my amazement with the deceptively mundane task of
predicting the next word, the magical trick task that
makes GPT-based language models work. A point I made in 2008
is that data-driven word predictors continue to be more accurate when trained on more data.
This relation is log-linear:
every ten-fold increase of training data leads to an approximately constant improvement in terms of the
percentage of correctly predicted words. This is expected given
Heaps' Law.
Taal in uitvoering, inaugural address, Radboud University, November 9, 2012
Het volgende
woord, inaugural address, Tilburg University, October 10, 2008
(animations shared on YouTube; this animation below showing word sequences in Dutch
where the next word is entirely predictable)
Past projects & highlights
Research projects
ADNEXT, Adaptive Information Extraction over Time, a work package of the Infiniti project, part of COMMIT.
Better-Mods, a research project developing new tools for citizens to allow them to be informed better about current debates on online platforms. A collaboration with nu.nl and the TULP research group at Tilburg University, funded by NWO. Aligned with this project, also funded by NWO (NWA Wetenschapscommunicatie) and with partners NEMO Kennislink, KNAW Meertens Instituut, Netwerk Mediawijsheid, and Alliantie Digitaal Samenleven, the classroom game Wie Is De Trol? (who is the troll?) was developed.
Cultural AI, a lab for culturally valued AI. Cultural AI is the study, design and development of socio-technological AI systems that are implicitly or explicitly aware of the subtle and subjective complexity of human culture.
This course introduces you to Transformers. With the release of ChatGPT in November 2022, the Transformer, the T in GPT, firmly took position as the number one working horse in AI.
It changed the field of natural language processing overnight, and its revolutionary industrial impact might be huge and lasting.
Yet, the working horse at its core can only do one thing: predict the next word.
How can a next-word predictor show good aptitude at school tests? How is it able to reason?
What are the problematic aspects of this technology, and what are the current developments?
Current (co-) supervised Ph.D. students
Shane Kaszefski Yashuk (with Massimo Poesio and Dong Nguyen)
Golshid Shekoufandeh (with Paul Boersma)
Joris Veerbeek (with Karin van Es and Mirko Schäfer)
Xiao Xu (with Anne Gauthier and Gert Stulp)
Ronja van Zijverden (with Marloes van Moort, Karin Fikkers, and Hans Hoeken)
Aside from papers and dissertations, our projects tend to produce software.
We make a point of maximizing the availability of this software by releasing the best software projects under
open source licenses. Some of our software, particularly the packages that perform some natural language
processing function, are available as webservices,
usually with a web interface. Much of this software has been integrated in larger digital research
infrastructures at the national and European level.
As part of past and ongoing projects with many colleagues I was involved in developing the following software
and digital research infrastructure:
Natural Language Processing
Frog: Dutch tagger-lemmatizer, morphological analyzer,
and dependency parser. With the Frog development team.
Olifant: Memory-based language
modeling. With Peter Berck, Maarten van Gompel, Ko van der Sloot, Teun Buijse, Ainhoa Risco Paton.
Described in this this paper. Olifant is based on
older work on memory-based spelling and grammar correction with WOPR. Below is a screenshot from
March 2010 in which you see WOPR
in generative mode, predicting "Word Salads": consecutive next words based on an initial prompt.
This is how predecessors of ChatGPT looked like
at a time when we had little data and limited high-performance computing facilities.
T-Scan (also available as
web interface and web service): Dutch text analyzer for computing text
features indicative of text complexity. With Rogier Kraf, Henk Pander Maat, Ko van der Sloot, Maarten van Gompel,
Florian Kunneman, Susanne Kleijn, Ted Sanders, Maarten van der Klis, Sheean Spoel, Luka van der Plas.
Valkuil.net and Fowlt.net:
Dutch and English context-sensitive spelling correctors. With Maarten van Gompel, Wessel Stoop,
Tanja Gaustad van Zaanen, and Monica Hajek.
Colibri Core: Efficient n-gram and skipgram
modeling. With Maarten van Gompel.
Mbt: Memory-based tagger-generator and tagger.
With Ko van der Sloot, Jakub Zavrel, and Walter Daelemans.
Machine Learning
Timbl: Tilburg memory-based learner. With Ko van der Sloot, Walter Daelemans, and Jakub Zavrel.
Digital Research Infastructure
CLARIAH, Common Lab Research Infrastructure for the Arts and Humanities.
Nederlab, bringing together massive amounts of digitized Dutch texts from the Middle Ages to the present in one user-friendly and tool-enriched web interface. Funded by NWO.
FutureTDM, a Horizon 2020 coordinate and support action on reducing barriers and increasing uptake of text and data mining for research environments.