Data Pipelines
The Helmholtz-KG operates on a fully automated weekly pipeline managed by Apache Airflow. The processes are deployed for
- Harvesting publically available metdata from Helmholtz sources,
- Processing metadata wihin our internal system (map, validate & enrich source data),
- Injecting mapped sematnics into our graph data base in form of RDF triples.
The following diagram shows how metadata flows through each transformation stage, from external sources to the RDF Graph:
Harvesting
Data is harvested through commonly used integration patterns rather than bespoke pipelines. Specific pipelines are only deployed and maintained where large quantities of data or multiple endpoints can be harvested through a stable pipeline. Harvesting methods are described below, a full list of data sources and methods how they are harvested can be found here. Metadata is always harvested in its original form and transformation are only carried out post-hoc harvesting to preserve node to record provenance.
Harvesting Methods
OAI-PMH-API The pipelines harvests metadata via the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol, a widely used standard for exposing structured metadata from digital repositories and library catalogues. OAI-PMH provides a uniform interface that allows harvesters to retrieve metadata records in bulk through standardized requests.
Details
Our OAI-PMH pipelines are currently primarily used to harvest metadata from Helmholtz library systems, which means that the majority of records collected through this method describe scholarly documents such as journal articles, books, theses, and other publication-related materials. At present, the KG harvests several metadata formats exposed through OAI-PMH, including OAI-DC (Dublin Core) and OAI-MARC21, both of which are commonly supported by library infrastructures.
In the future, support for OAI-DataCite is planned in order to integrate richer metadata for research outputs that are registered through DOI-based infrastructures.
DataCite-API DataCite records are retrieved directly through the DataCite REST API. DataCite provides rich metadata for research outputs associated with DOIs, including datasets, software, publications, and other research products. The data is provided in the DataCite schema.
Details
The DataCite harvester currently does not rely on DOI prefixes associated with specific Helmholtz infrastructures, as these prefixes are not comprehensively documented or centrally maintained. Instead, the pipeline identifies relevant records based on the presence of ROR (Research Organization Registry) identifiers corresponding to Helmholtz centers. The harvester queries DataCite for records that reference one of these curated ROR identifiers, ensuring that only assets explicitly linked to Helmholtz organizations are integrated into the graph. This approach provides a clear and reliable scope for harvesting while avoiding the risk of unintentionally ingesting unrelated records.
While accessing DataCite metadata allows the Helmholtz KG to integrate widely used, DOI-based metadata that is already curated and maintained by repositories and data providers, we recognize that our current methods are restrictive: many records may reference Helmholtz centers in textual affiliation fields without including a corresponding ROR identifier and are therefore currently not captured. Future development of the DataCite harvester will aim to refine this strategy by incorporating additional knowledge about Helmholtz-related infrastructures, repositories, and DOI prefixes. These improvements will allow the pipeline to identify relevant records more comprehensively while maintaining a well-defined scope for the Helmholtz KG.
ROR-API The KG harvests organization metadata from the ROR API based on a curated list of Helmholtz ROR identifiers. Records are retrieved in the native ROR schema and include standardized information on institutions and their relationships. A custom mapping layer transforms this data into the internal KG model.
Indico-API Event metadata is harvested from Indico instances used across Helmholtz via their REST APIs. The retrieved records follow the native Indico data structure, describing events such as conferences, workshops, and seminars. This metadata is integrated into the Helmholtz KG through a custom mapping process, aligning event information with the internal data model and linking it to related entities
Sitemap crawling In addition to harvesting metadata from APIs and repository endpoints, the Helmholtz Knowledge Graph harvests metadata from the web by crawling website sitemaps. This approach enables the KG to harvest metadata from infrastructures that publish structured web metadata but do not expose dedicated APIs, while relying on widely adopted semantic web standards designed for interoperability and machine readability.
We recommend sitemap crawling as the designated way to connect repositories to the Helmholtz KG because: (A) Exposing metadata embedded in html script headers increases individual visibility on the web - independently of integration and representation by the Helmholtz KG, and (B) while crawling is typically slower and less controlled, it is genrally more stable and independent of changes and updates of the endpoint itself.
Details
How does sitemap crawling of web-metadata work? XML sitemaps provide structured lists of URLs that belong to a website and are typically intended for search engines to discover pages efficiently. The crawling pipeline retrieves these sitemap files, iterates over the listed URLs, and visits the corresponding pages — most often the landing pages of digital assets such as datasets, publications, or software records.
During this process, the crawler scans the HTML of each page for embedded Schema.org metadata expressed as JSON-LD. This metadata is typically included in a / tag within the page source. The JSON-LD block contains a structured description of the resource using Schema.org vocabularies, including properties such as identifiers, creators, licenses, and links to related entities. When such metadata is detected, the crawler extracts the JSON-LD content which can then be validated and converted it into RDF representations that can be integrated into the Helmholtz KG.
Data Storage (Records)
Following the harvesting stage all records are stored within a PostgreSQL database to preserve record provenance and allow future re-mapping (e.g. when extending the data model). Every record is assigned a PURL within a designated namespace our namespace purls.helmholtz-metadaten.de/helmholtzkg_api to ensure that it can be globally uniqely identified. The raw data is available to the public through the Data Storage API.
Processing
Duing our internal processing the source metadata is transformed into the Helmholtz Knowledge Graph data model through a dedicated mapping and validation stage. The internal model is implemented using Pydantic models and JSON Schema, and is largely derived from Schema.org types. This ensures both structural consistency and interoperability with widely adopted standards. A custom Python-based mapping engine retrieves the harvested records from the data storage layer and converts them into this unified representation.
During this transformation, the system performs a series of validation, normalization, and semantic enhancement steps to improve data consistency and connectivity. Central to this process is the handling of globally unique identifiers: entities in the graph are consistently identified via normalized identifiers such as DOIs, RORs, and ORCIDs. When present, these identifiers are standardized (e.g., enforcing HTTPS, removing redundant prefixes, and harmonizing casing conventions) to ensure stability and uniqueness. If no suitable identifier is provided, identifiers are generated during mapping using type-specific strategies to maintain consistent referencing across the graph. For this a hash is generated and stored within our graph namespace purls.helmholtz-metadaten.de/helmholztkg/ to ensure that all our nodes are gobally uniquely identifiable.
In addition to identifier normalization, the mapping process applies lightweight semantic enrichment. This includes automated type inference, where missing entity types are derived from the usage of specific attributes based on Schema.org domain and range definitions. For example, the presence of an attribute such as affiliation may lead to the classification of an entity as a Person due to the property being unique to Schema:Person. This improves the overall structure and queryability of the graph.
Injection
Once validated and transformed, the resulting records are dumped and loaded in defined batches into the graph database (OpenLink Virtuoso), where they become part of the integrated knowledge graph. The data at this stage is defined as our ground-truth data. Semantic integration ensures that metadata from heterogeneous sources is consistently structured, reliably identified, and semantically aligned, enabling meaningful connections between entities such as datasets, publications, organizations, and researchers.
