Prof. Dr Birgitta König-Ries (l.) and Felicitas Löffler.

Searching for gold in a data lake

Scientific work produces a high amount of data. In order to keep this data usable for future research questions, researchers set up a »data lake«.

Prof. Dr Birgitta König-Ries (l.) and Felicitas Löffler.

Image: Jens Meyer (University of Jena)

Whether in the newly launched »ThWIC« water innovation cluster or in successfully established research networks, scientific work always produces a wealth of data. Measurements, texts, images, digitalized historical artefacts or computer simulations are collected, documented and stored. But what happens to the information once the research project has ended and the results have been published? Researchers are setting up a »data lake« so that the data they have collected can also be used for future research questions.

By Ute Schönfelder

This data lake is a piece of cloud infrastructure that is currently being set up in the University Computer Centre as part of a water cluster sub-project (»ThWICData«). In addition to the University Computer Centre, the »Competence Center Digital Research« at the University of Jena and an external business partner are also involved in the project. The aim is to centralize, store, edit and archive data and information from all of the 22 »ThWIC« projects for long-term use. The University of Jena has taken on the role of cloud provider. The data from each individual cluster project – from streamed sensor data from sewage pipes and information from sociological interviews to image, sound and text data – will »flow« into the data lake via automated »pipelines«. »We're creating a centralized access structure for this purpose, through which the partners involved in each project will initially have access. In the long term, however, we also want to make data accessible to the general public,« says Felicitas Löffler from the Institute of Computer Science, who coordinated the cluster's data science division.

But what are the advantages of setting up a huge cloud and centralizing data instead of storing it locally at the locations where it is collected? »This mainly helps us secure our research data in the long term,« explains Prof. Birgitta König-Ries. The holder of the Heinz-Nixdorf Professorship of Distributed Information Systems points out that long-term data availability has become increasingly important in recent years – and not only in research. The German government set up the Council for Scientific Information Infrastructures in 2014, and the National Research Data Infrastructure was subsequently launched in 2020. As part of this initiative, Prof. König-Ries is involved in a project relating to the management of biodiversity research data. »In order to make the research data available for long-term use across borders and disciplines, it's stored according to the FAIR principles,« says König-Ries. FAIR stands for Findable, Accessible, Interoperable and Reusable.

Open formats and precise metadata

Sustainable research data management involves much more than long-term storage. Storing data for many years has long been part of good scientific practice. »However, this often meant that data was simply stored somewhere and couldn't be used by others, because it wasn't locatable or understandable,« says König-Ries. That's why it is important to describe the data as precisely as possible. This metadata contains additional information that allows people who didn't collect the data themselves to relate to it and understand it (e.g. details about the methods used and the underlying scientific issues). »It's also important that the metadata is stored in open formats that are universally understandable«.

In the water cluster, Birgitta König-Ries and Felicitas Löffler are working together with a business partner as part of the »ThWICSonar« project with the aim of automatically capturing and tagging text documents. »We want to set up an information system that can be used to automatically monitor documents, classify them by subject and describe them«. They are doing this with the help of artificial intelligence. »First of all, we have to tag documents manually so that we can then train the algorithms. This will create a language model to enable automatic tagging in the future«. The prepared documents will then be proactively recommended to various user groups within the cluster.

In their work for »ThWIC«, the researchers build on existing expertise. For example, an »Electronic Lab Notebook« (ELN) for the management of chemical research data has been under development at the University of Jena and the Leibniz Information Centre for Science and Technology in Hanover since 2020 as part of the National Research Data Infrastructure initiative. The project is being coordinated by Prof. Christoph Steinbeck, who is also involved in the »ThWIC« cluster. »The ELN is an electronic version of the classic laboratory notebook, but it also offers lots of advantages in terms of long-term data use,« emphasizes Felicitas Löffler. If data can be shared with other researchers and stored in a cloud, for example, it is available all around the world and can be linked to other sources. »This makes it possible to use data at different locations and to identify and understand overarching relationships«.