SPARQL: Querying the Helmholtz Knowledge Graph

Overview

The Helmholtz Knowledge Graph (Helmholtz-KG) represents a vast web of research entities—including datasets, publications, software, and organizations—interconnected as RDF (Resource Description Framework) triples. It currently maintains two distinct SPARQL 1.1 compliant endpoints to serve different needs. Results can be retrieved in a wide range of formats, including HTML, XML, JSON, Javascript, CSV, TSV, and Spreadsheet for tabular data, as well as RDF/XML, Turtle, N-Triples, JSON-LD, and N3 for graph-based results. This allows for seamless integration into diverse workflows, from automated data harvesting to complex cross-domain relationship discovery.

By offering these two distinct endpoints, the Helmholtz-KG provides researchers with the flexibility to choose the engine best suited to their specific requirements—whether they need high-speed exploration and real-time autocompletion via QLever or reliable production-grade integration and extensive serialization support via Virtuoso.

High-Performance Discovery: QLever

The primary interface for rapid exploration is powered by QLever, a high-performance, in-memory SPARQL engine. It is accessible at sparql.unhide.helmholtz-metadaten.de and qlever.unhide.helmholtz-metadaten.de.

Production-Grade Integration: OpenLink Virtuoso

The second endpoint is powered by OpenLink Virtuoso, a robust engine optimized for production-grade stability, broad data interoperability, and standard-compliant metadata serving.

It is accessible via the interactive web interface at virtuoso.unhide.helmholtz-metadaten.de.

Advantages of Each Endpoint

Endpoint	Advantages
QLever	Distinguishing Features: QLever distinguishes itself through superior query performance on massive datasets and unique context-sensitive autocompletion, which suggests only entities and properties that lead to non-empty results. Additionally, it offers native integration of text and spatial search, enabling high-speed execution of complex queries across both structured data and full-text corpora. Access URLs: qlever.unhide.helmholtz-metadaten.de and sparql.unhide.helmholtz-metadaten.de. Best For: Fast, interactive discovery and exploratory data analysis.
Virtuoso	Extensive Export Formats: Virtuoso supports a wide range of serializations for seamless integration into any workflow: Tabular Data (HTML, XML, JSON, Javascript, CSV, TSV, and Spreadsheet) and Data (RDF/XML, Turtle, JSON-LD, N-Triples, and N3) Interoperability: Fully SPARQL 1.1 compliant, supporting federated queries (SERVICE clauses) to bridge the Helmholtz-KG with external graphs like Wikidata or ROR. Access URL: virtuoso.unhide.helmholtz-metadaten.de Best For: Automated harvesting, production dashboards, and long-term research citations.

Endpoint

Advantages

QLever

Distinguishing Features: QLever distinguishes itself through superior query performance on massive datasets and unique context-sensitive autocompletion, which suggests only entities and properties that lead to non-empty results. Additionally, it offers native integration of text and spatial search, enabling high-speed execution of complex queries across both structured data and full-text corpora.
Access URLs: qlever.unhide.helmholtz-metadaten.de and sparql.unhide.helmholtz-metadaten.de.
Best For: Fast, interactive discovery and exploratory data analysis.

Virtuoso

Extensive Export Formats: Virtuoso supports a wide range of serializations for seamless integration into any workflow: Tabular Data (HTML, XML, JSON, Javascript, CSV, TSV, and Spreadsheet) and Data (RDF/XML, Turtle, JSON-LD, N-Triples, and N3)
Interoperability: Fully SPARQL 1.1 compliant, supporting federated queries (SERVICE clauses) to bridge the Helmholtz-KG with external graphs like Wikidata or ROR.
Access URL: virtuoso.unhide.helmholtz-metadaten.de
Best For: Automated harvesting, production dashboards, and long-term research citations.

Introduction to SPARQL

SPARQL (pronounced “sparkle”) is the standard query language for RDF (Resource Description Framework) data, designed by the World Wide Web Consortium (W3C) to access and retrieve information from graph‑based data stores. It stands for SPARQL Protocol And RDF Query Language and enables queries over semantic graph data. SPARQL is the standard query language used to navigate the Helmholtz Knowledge Graph, allowing users to perform complex, machine-readable searches across millions of interconnected research entities.

By using graph pattern matching, it enables researchers to uncover deep relationships between datasets, software, and publications that traditional keyword searches might miss.

What is RDF?

Before diving into SPARQL, it helps to understand RDF, the data model SPARQL operates on. RDF Data in the Helmholtz-KG is stored in the form of Triples, which consist of three parts:

Subject: The entity you are describing (e.g., a specific Dataset).
Predicate: The relationship or property (e.g., schema:creator).
Object: The value or related entity (e.g., a Researcher's name).

Each triple expresses a fact or relation, such as a dataset having a title or an author being affiliated with an institution. A collection of these triples forms a graph. In Helmholtz-KG, RDF triples represent metadata relationships — for example, linking a dataset to its creators, distributions, related publications, or licensing information.

Anatomy of a SPARQL Query

A standard query to the Helmholtz-KG typically consists of four main blocks:

PREFIX: Shortcuts to long URIs (e.g., schema: instead of https://schema.org/).
SELECT: Defines which variables (marked with a ?) you want to see in your results.
WHERE: The "pattern" you are looking for in the graph, enclosed in curly braces {}.
LIMIT: Constrains the number of results returned.

Common SPARQL Query Forms

SPARQL supports several forms of queries, each serving different needs: (sparql.dev)

Query Type	Purpose
SELECT	Retrieves tabular result sets based on matched patterns.
ASK	Returns a Boolean indicating whether a pattern exists.
CONSTRUCT	Builds a new RDF graph based on matched patterns.
DESCRIBE	Returns a description of a resource as an RDF graph.

Variables and Patterns

In SPARQL, variables begin with a ?, such as ?dataset or ?title. The WHERE clause defines triple patterns where variables match parts of the graph. For example:

?dataset schema:creator ?person .
matches any triple where a dataset has a creator, binding the matching resource to the variable ?person.

Prefixes and Namespaces

SPARQL uses PREFIX to simplify queries. Instead of writing full URIs, you can define a prefix once and reuse it:


PREFIX schema: <http://schema.org/>

This helps shorten queries and improves readability.

Getting Started with SPARQL in Helmholtz-KG

Once you understand the basics above, you can start writing queries against the Helmholtz-KG using either the QLever or Virtuoso SPARQL endpoints.

Build your first query

Step 1: Define Your "Shorthand" (PREFIX)

URLs in the Semantic Web are very long. Instead of typing https://schema.org/name every time, we create a shortcut called a Prefix .

PREFIX schema: <https://schema.org/>

The core Metadata Prefixes for Helmholtz-KG are listed below. The Helmholtz-KG primarily uses Schema.org for high-level research objects, supplemented by standard W3C vocabularies for cataloging and provenance.

Prefix	Namespace	Use Case
schema	https://schema.org/	Datasets, Software, Persons, Organizations, Events
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#	Basic RDF types and properties
rdfs	http://www.w3.org/2000/01/rdf-schema#	Labels and human-readable descriptions
prov	http://www.w3.org/ns/prov#	Provenance, tracking how data was created or modified

Step 2: Choose Your Variables (SELECT)

Decide what information you want the query to return. Variables are always preceded by a question mark (?).

SELECT ?dataset ?title

Step 3: Map the Pattern (WHERE)

This is the heart of the query. You describe the "Triple" (Subject → Predicate → Object) that you are looking for. End each line with a period (.) unless you are grouping properties for the same subject with a semicolon ';'.

WHERE {
  ?dataset a schema:Dataset ;           # Find things that are Datasets
           schema:name ?title ;         # Get their names
           schema:creator ?person .     # Find the creator entity
  ?person  schema:name ?authorName .    # Get the name of that creator
}

Step 4: Refine the Results (LIMIT)

Knowledge graphs can be massive. Always start with a LIMIT to avoid crashing your browser with thousands of results. `LIMIT~ keyword limits the results to a specific number.

LIMIT 10

Putting it all Together: A Simple SELECT Query

PREFIX schema: <https://schema.org/>

SELECT ?title ?authorName
WHERE {
  ?dataset a schema:Dataset ;
           schema:name ?title ;
           schema:creator ?person .
  ?person  schema:name ?authorName .
}
LIMIT 5

This query retrieves the 5 datasets and their titles and creators. It uses a basic graph pattern to match triples where the subject is a dataset with a name.

Filters and Constraints

Filters and Constraints to narrow down your results. These allow you to search for specific text, compare dates, or exclude certain data.

Filtering by Text (FILTER)

If you want to find items where a title contains a specific word (e.g., "climate"), use the FILTER keyword with CONTAINS

PREFIX schema: <https://schema.org/>

SELECT ?title WHERE {
  ?dataset a schema:Dataset ;
           schema:name ?title .
  # Only return titles containing the word "climate" (case-insensitive)
  FILTER(CONTAINS(LCASE(?title), "climate"))
}
LIMIT 10

Numerical and Date Constraints

You can use standard mathematical operators like >, <, and = to filter results based on dates or quantities. The following query filter the results to only show datasets published after January 1st, 2023

PREFIX schema: <https://schema.org/>

SELECT ?name ?date WHERE {
  ?dataset a schema:Dataset ;
           schema:name ?name ;
           schema:datePublished ?date .
  # Only show datasets published after January 1st, 2023
  FILTER(?date > "2023-01-01"^^xsd:date)
}

The language

Filters can also checks the language tag of a string.

PREFIX schema: <https://schema.org/>

SELECT ?name ?date WHERE {
  ?dataset a schema:Dataset ;
           schema:name ?name ;
           schema:datePublished ?date .
  # Only show datasets with English titles
  FILTER(LANG(?name) = "en")
}

Sample queries

This sections provides several examples to demonstrate how to extract data from this graph.

List Datasets created by a specific organization.

In a Knowledge Graph, power comes from relationships. This query finds datasets and connects them to their creators, filtering the results to only show datasets created by a specific organization (e.g., "Helmholtz").

PREFIX schema: <https://schema.org/>

SELECT ?title ?authorName ?orgName WHERE {
  ?dataset a schema:Dataset ;
           schema:name ?title ;
           schema:creator ?person .
           
  ?person schema:name ?authorName ;
          schema:affiliation ?org .
          
  ?org schema:name ?orgName .
  
  # Filter for a specific organization keyword
  FILTER(CONTAINS(?orgName, "Helmholtz"))
}

List Software with MIT or `legalcode` license.

PREFIX schema: <https://schema.org/>
SELECT ?software ?softwareName ?license 
WHERE {
  ?software a schema:SoftwareSourceCode ;
            schema:name ?softwareName ;
	        schema:license ?license .
	 FILTER(?license in (<https://opensource.org/licenses/MIT> , <https://creativecommons.org/licenses/by/4.0/legalcode>))      
  
}

List Top 3 organisations in the Helmholtz Association by total count of published digital assets (datasets, documents, and software combined).

PREFIX schema: <https://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?orgName (COUNT(DISTINCT ?asset) AS ?assetCount)
WHERE {
  ?asset schema:publisher ?org .
  ?org schema:name ?orgName .
  ?asset rdf:type ?assetType .
  FILTER (?assetType IN (schema:Dataset, schema:ScholarlyArticle, schema:SoftwareSourceCode))
}
GROUP BY ?orgName
ORDER BY DESC(?assetCount)
LIMIT 3

List contributions and their types that demonstrate collaboration by a person from Forschungszentrum Jülich (FZJ).

PREFIX schema: <https://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?outputName ?assetClass ?fzjAuthorName ?fzjOrgName 
WHERE {
        ?output rdf:type schema:Thing.
	?output (schema:creator|schema:author|schema:contributor) ?fzjAuthor .
	?fzjAuthor schema:affiliation ?fzjOrg .
	?fzjOrg schema:name ?fzjOrgName .
	FILTER (REGEX(?fzjOrgName, "Forschungszentrum Jülich|FZJ", "i"))
	?fzjAuthor schema:name ?fzjAuthorName .
	?output schema:name ?outputName . 
  	OPTIONAL { ?output rdf:type ?assetClass . }
}

List Scholarly articles that directly cite a Dataset.

PREFIX schema: <https://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT *
WHERE {
    ?article rdf:type schema:ScholarlyArticle .
    ?article schema:citation ?object.
    ?object rdf:type schema:Dataset .
}

List all seminars since 2020 organized by Helmholtz AI about neural networks.

PREFIX schema: <https://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT DISTINCT *
WHERE {
  ?event rdf:type schema:Event; schema:name ?eventName .
  OPTIONAL { ?event schema:startDate ?startDate . }
  OPTIONAL { ?event schema:date ?dateFallback . }
  FILTER (xsd:dateTime(?startDate)>="2020-01-01"^^xsd:date)
  OPTIONAL { ?event schema:organizer/schema:name ?organizerName . }
  OPTIONAL { ?event schema:url ?eventURL . }
  OPTIONAL { ?event schema:description ?description . }
  FILTER ( (REGEX(?eventName, "seminar", "i")  ))
  FILTER ( (REGEX(?organizerName, "Helmholtz AI", "i")  ))
  FILTER ( (REGEX(?description, "neural networks", "i")  ))
}

Overview​

High-Performance Discovery: QLever​

Production-Grade Integration: OpenLink Virtuoso​

Advantages of Each Endpoint​

Introduction to SPARQL​

What is RDF?​

Anatomy of a SPARQL Query​

Common SPARQL Query Forms​

Variables and Patterns​

Prefixes and Namespaces​

Getting Started with SPARQL in Helmholtz-KG​

Build your first query​

Step 1: Define Your "Shorthand" (PREFIX)​

Step 2: Choose Your Variables (SELECT)​

Step 3: Map the Pattern (WHERE)​

Step 4: Refine the Results (LIMIT)​

Putting it all Together: A Simple SELECT Query​

Filters and Constraints​

Filtering by Text (FILTER)​

Numerical and Date Constraints​

The language​

Sample queries​

List Datasets created by a specific organization.​

List Software with MIT or legalcode license.​

List Top 3 organisations in the Helmholtz Association by total count of published digital assets (datasets, documents, and software combined).​

List contributions and their types that demonstrate collaboration by a person from Forschungszentrum Jülich (FZJ).​

List Scholarly articles that directly cite a Dataset.​

List all seminars since 2020 organized by Helmholtz AI about neural networks.​