RMLStreamer with Reference Conditions in the KGCW Challenge 2023

Els de Vleeschauwer; Gerald Haesendonck; Dylan Van Assche; Ben De Meester

RMLStreamer with Reference Conditions in the KGCW Challenge 2023

Els de Vleeschauwer, Gerald Haesendonck, Dylan Van Assche, Ben De Meester

24 Jun 2023ESWC 2023 Workshop KGCW SubmissionReaders: Everyone

Keywords: RMLStreamer, challenge, knowledge graph construction

TL;DR: RMLStreamer has a very scalable performance on execution time and CPU usage, while maintaining a constant memory usage, and therefore it received the Scalability Award in the KGCW 2023 Challenge.

Abstract: Knowledge graph construction of heterogeneous data has seen a lot of uptake in the last decade from compliance to performance optimizations with respect to execution time. Besides execution time as metric for comparing knowledge graph construction, other metrics, e.g. CPU or memory usage, are often not considered. The Knowledge Graph Construction Workshop (KGCW) 2023 Challenge aims to be a competitive challenge for knowledge graph construction systems to encourage optimizations for execution time, CPU, and memory usage. We participated in this challenge with RMLStreamer, an RML mapping engine which processes all data in a streaming fashion. As the second part of the challenge is based on the Madrid-GTFS-Bench, which cannot be handled by RMLStreamer, we added RMLLooseGenerator as a first step for those experiments. RMLLooseGenerator is a proof-of-concept implementation that simulates the effect of using reference conditions in RML mapping rules. In previous work we showed that using reference conditions in the GTFS-Madrid-Bench mapping file results in exactly the same graph output. The challenge results show that RMLStreamer has a very scalable performance on execution time and CPU usage, while maintaining a constant memory usage. Therefore it received the Scalability Award in the KGCW 2023 Challenge. The challenge also highlighted some weaknesses of RMLStreamer: no support for relational databases, inefficient implementation of join operations, and longer execution time when handling nested sources such a JSON and XML files. After the challenge RMLStreamer has been expanded with support for relational databases. For future work we will research how to further optimize the handling of joins and of nested sources.

1 Reply

Loading