Natural language generation research wins ERC grant

Friday, 14 January 2022 09:20

“Our goal is to create a universal natural language generator that will be able to learn from only a few examples and will not make mistakes,” says Ondřej Dušek from the Faculty of Mathematics and Physics. If all goes well, in the future users can look forward to virtual assistant technology like Amazon’s Alexa, in Czech, and other applications that are able to clearly summarise even complex data. Dušek this week was awarded a prestigious Starting Grant from the European Research Council (ERC) for research on natural language generation. Four ERC grants in total went to the Czech Republic.

ATM01986 

“Imagine you have a weather station that measures temperature, air pressure, wind speed and other parameters. The output is a huge table of data, which is very hard to read. The task of the natural language generator is to automatically reorganise and present in clear terms the most important and relevant information. The generated text should be clear and easy to read,” explains Dušek. Similar applications at present already exist in sports, for instance, where certain sports news - the results of matches and who scored how many goals - are already generated automatically.

Essentially, there are two approaches to generating natural language today. Most commercial systems use templates, manually pre-prepared sentences where specific values (such as what time a train leaves) are just filled in during the “generation”. “Preparing templates is very labour-intensive, which is one of the reasons why Alexa is not available in Czech; it's expensive and not worth it for companies because we are only a small market,” the researcher says.

The second approach is the use of neural networks, which learn from huge amounts of data. “The problem is that these systems need a lot of input data - examples to learn from - to automatically generate texts. Again, it’s laborious and time-consuming when you have to write thousands of sentences as examples.” The computer scientist/linguist adds: “The second problem is that the generated sentences are often inaccurate or contain errors that are very difficult to find. On the face of it, sentences may look very natural, but the content does not match the input specification.”

ATM02024

The best of both worlds

According to the coordinator of the ERC Grants Expert Group, Zdeněk Strakoš, all successful ERC grants got backing because of a unique idea, as he told Forum in the past. Dušek’s project is taking a novel approach by not committing to just one path. “My project takes the best of both worlds - natural outputs and learning from neural networks and explicit control from templates. Today, most natural language generation research focuses on increasingly large and complicated neural network architectures. I, on the other hand, am going back in time a little with this project - in addition to neural networks, I want to use explicit semantic representation, which is hardly ever used in neural systems today, but used to be the only way to generate text, along with hand-written rules,” he explains.

The aim, then, of his five-year ERC project is to use neural networks, but to limit their scope so that they only focus on generating smooth and fluent sentences, but are not responsible for factual information. The facts will be firmly anchored in a semantic representation that can be tracked through the generation and the output sentences can be re-checked.

“This relates to other sub-goals of our project, such as finding whether a generated sentence is correct - we will be looking for new evaluation methods. Today, we use reference sentences written by humans and try to compare them word by word with automatically generated ones, which is very inaccurate: many words have similar meanings, but in the context of the whole sentence, they can mean something completely different. We will also try to be more efficient in our use of data and computing power - today's neural networks need large amounts of data and huge amounts of computing power to learn,” the project's principal investigator adds. In addition, the semantic representation will allow mathematical and logical operations to be carried out, further increasing flexibility, and the generator will correctly surmise from the score of a sports match, for example, who won and by how much, and adjust the output accordingly.

ATM01994Inspiration from research competitions and German studies

“The inspiration for the project came from my postdoctoral fellowship in Edinburgh, where among other things I was looking at the inaccuracies of automatically generated language. My colleagues and I organised a research competition where participants had to use neural networks to create a simple system that generated restaurant recommendations - sentences like 'This restaurant is in the city centre, serves Indian cuisine and is expensive.' Ironically, the teams that succeeded in the competition in terms of accuracy were the ones that used pre-made templates instead of neural networks. And since then, I've been thinking about how to use the best of neural networks but improve their accuracy.”

Dušek’s unconventional approach and the use of semantics was also helped by the fact that, in addition to linguistics at the Faculty of Mathematics and Physics (Matfyz), he also studied German at the Faculty of Arts at Charles University. “I have enjoyed programming and wanted to do computer science since high school. I really enjoyed studying at Matfyz, but gradually I began to miss talking to people. Most of the time we sat at computers and programmed, and since I'm a Prague native, I didn't even live in the dorms where most of my classmates got to know each other," the researcher recalls. Besides programming, he was also very fond of German, taught by Lenka Vachalovská at the faculty.

“At that time, my younger brother started studying Czech at the Faculty of Arts of Charles University and was thrilled with the atmosphere and the people, so I thought I would give it a try too. So I applied for German studies, got in and finished my studies - we were the last year of a five-year programme,” he says. “The combination was great, it was all connected - I had phonetics courses both at Matfyz and at the Faculty of Arts. The combination of mathematical and linguistic perspectives is useful even in my natural language generation work today. And the Faculty of Arts also taught me how to write, which I still benefit from today,” says Dušek, who considers himself more of a linguist; for example, he now learns Irish in his spare time, just for fun. “I realise I won’t have any practical use for it, but I enjoy it very much. It’s a curious language, it's completely different from anything I know, but it's also one of the Indo-European languages, so there are still a lot of similarities. And I also love Ireland,” he laughs.

33b89263 3ace 47f2 8e6e a03ffbca8572 1
Ondřej Dušek works at the Institute of Formal and Applied Linguistics, part of which is located in the new Holešovice building of Matfyz IMPAKT.

The goal? A universal generator

If all goes well, the ERC project will result in a universal tool that will be able to quickly learn to generate text about new topics. “We want our generator to be able to learn from just a few examples and to generate correctly; not to make up or leave out parts of the information. We also want it to easily generate sentences in languages other than English. We hope that we can make these approaches to generation available to companies and for commercial use. I'm looking forward to having a smart assistant at home that I can speak to in Czech, or open an app on the web that gives me a summary of the day’s news,” he says.

The linguist learned he had got an ERC grant while presenting a lecture just ahead of the holidays, when he registered a message notification from his student in the corner of his screen: “The ERC grant, wow!”. “Of course, I didn't know anything about it, I continued my presentation and only after the lecture was over did I find a congratulatory email from the head of the department, Associate Professor Barbora Vidová Hladká, and several colleagues. But I still didn't have the official announcement, I found later it in my spam folder,” he recalls with a smile. The ERC grant he received is worth 1.5 million euros in funding.

How does Ondřej Dušek feel about the result? “I am delighted and feel very honoured to have received the grant. At the same time, it’s a great responsibility. My thanks go to everyone who helped me to succeed. There is no way I could have done it without the support of my Ph. D. students, colleagues and many other people who gave me feedback or helped me prepare the grant interview presentation. The ERC workshop system, initiated by Professor Strakoš and involving many other people, is absolutely invaluable and has helped me a lot.”

Ondřej Dušek, Ph. D.
Ondřej Dušek studied computational linguistics at the Faculty of Mathematics and Physics and German at the Faculty of Arts at Charles University. He spent two years as a postdoctoral fellow at Heriot-Watt University in Edinburgh. He is now an assistant professor at the Institute of Formal and Applied Linguistics where he researches natural language generation and dialogue systems (chatbots), especially using machine learning and neural networks.
Author: Pavla Hubálková
Photo: Tomáš Rubín, Vladimír Šigut

Share article: