PROV4ITDaTa - Technical documentation

PROV4ITDaTa is available at

An online version of this report is available at

PROV4ITDaTa has been presented at

Data portability is one of the pillars to enable users to control their data on the Web. Initiatives such as the Data Transfer Project (DTP) take a step in this direction, by providing an open-source, service-to-service data portability platform so that individuals could move their data whenever they want. However, such efforts — being hard-coded — miss transparency and interoperability. On the one hand, when data is transferred between services, there is no trust in whether this transfer was of high quality. To assess the transfer’s quality, assessment would require reviewing the source code, as data provenance is unavailable. On the other hand, complying with this hard-coded platform requires development effort to create custom data models. Existing interoperability standards are neglected.

In PROV4ITDaTa, we create a solution that is fully transparent and has a fine-grained configuration to improve interoperability with other data models. Furthermore, the intermediate dataset generated from the source data and its provenance are FAIR resources. To achieve this, we will exploit and advance the existing open-source tools and Comunica and show its extensibility by directly applying it to the Solid ecosystem.

Combining and Comunica for a fully transparent transfer process, we improve current data portability approaches. Also, assessing the provenance trail can add trust to a transfer process. Being a fine-grained and declarative system, it sets up detailed configuration and provenance generation. The transfer of personal data can be assessed before the data is accessed, and legal audits can be performed automatically based on the structured and semantically sound provenance trail. To show its applicability and extensibility for decentralized environments, we will connect it to the Solid ecosystem, giving users full control over their data.


graph LR
  linkStyle default interpolate basis

    RMW[RMLMapper Web API]
    WA[Web App]
    CE[Comunica engine]

    subgraph  sgServices [Data Providers]
    subgraph sgRML []
    subgraph sgSolid [Solid]

    subgraph PROV4ITDaTa

    subgraph sgComunica [Comunica]

    RM --> RMW
    RMW --> WA
    WA ---|1. Authorization| x( )
    x --- IMGUR
    x --- FLICKR
    x --- OTHER
    x --- SP

    y( ) -->|2. Consume data from services| RM;
    IMGUR --> y;
    FLICKR --> y;
    OTHER --> y;

    WA -->|3. Store| SP

    %% interactions: Comunica
    WA -->|4. Query intermediate datasets| CE
    CE --> SP


The architecture comprises five main components

The main component is the PROV4ITDaTa Web App, which performs the different steps required for transferring data from services (e.g. Flickr, Imgur, etc.) to a Solid pod.

  1. During the first step, the Web App obtains the credentials for authorized access to the services and the Solid pod.
  2. In the second step, the toolset provides the necessary components for
    1. consuming (protected) resources from Web APIs, and
    2. transforming the fetched data to RDF.
  3. In the third step, the Web App stores the RDF and provenance data, generated by the RMLMapper-JAVA, to the Solid pod.
  4. In the fourth step, the newly generated RDF data can be integrated with internal and external data sources, transformed, and again stored in the Solid pod.

RML Mapping documents

An RML Mapping provides the means to create semantically enriched RDF data from heterogenous and (semi-) structured sources, using a declarative set of rules (RML).

Describing the Web API as a source in an RML Mapping allows the RMLMapper to determine how requests should be made to consume the Web API. Furthermore, it requires a one-time effort and avoids hard-coded implementations for each service to be added. The mapping can easily be reused for — and extended to — similar Web APIs.

The RML Mappings are served statically and can be found in the /public/rml-directory.

Documentation on how to create RML Mappings is found on


The RMLMapper, proxied by the RMLMapper Web API, processes the selected RML Mapping and yields both the generated RDF and provenance data.

In the background, the RML Mapper determines how to consume a Web API based on its description, without the need for integrating code that is tightly coupled to a specific Web API. Such hard-coded implementations are error-prone and time-consuming (consider, for example, the creation of tests and rebuilding the application every time a service is added).

Solid pods

Solid pods can be used to create a Personal Information Management System (PIMS), giving individuals more control over their personal data. Individuals can manage their personal data in secure online decentralized storage systems called pods: secure personal web servers for data. All data in a pod is accessible via the Solid Protocol.

When data is stored in someone’s pod, they control who and what can access it. and share them when and with whom they choose using Solid pods. Providers of online services and advertisers need to interact with the Solid pod if they plan to process the individual’s data. This enables a human-centric approach to personal information and new business models.

Any kind of data can be stored in a Solid pod, however, the ability to store Linked Data is what makes Solid special. Linked Data gives Solid a common way to describe things and how they relate to each other, in a way that other people and machines can understand. This means that the data stored by Solid is portable and completely interoperable.

Anyone or anything that accesses data in a Solid pod uses a unique ID, authenticated by a decentralized extension of OpenID Connect. Solid’s access control system uses these IDs to determine whether a person or application has access to a resource in a pod.


Comunica provides a meta-query engine that is designed in a highly modular and configurable manner to deal with the heterogeneous nature of Linked Data on the Web, allowing to fine-tune the Comunica engine to completely suit the needs of the system. Furthermore, Comunica also supports executing SPARQL queries over one or more interfaces.

Incorporating Comunica, as depicted in the Architectural diagram above, allows us to select data from the intermediary datasets on the Solid Pod and transfer it to new services.


Specific parts of the intermediate RDF datasets – generated by the Components/RMLMapper – can be filtered by applying queries. Hence, allowing for a fine-grained configuration of which parts of the data will be included in the final dataset.

The queries are served statically and can be found in the /public/sparql-directory.


To achieve a high level of transparency, we have added provenance support to the Comunica architecture as well, based on the design we made in

We are now able to provide different levels of provenance to the query result:

The following new actors to the Comunica architecture are depicted by the diagram below:

graph TD

  linkStyle default interpolate basis

    A_CS(Comunica SPARQL)

    B_OQO[Optimize Query Operation]
    B_QO[Query Operation]

    O_QO((Query<br> Operation<br> Observer))
    A_CP(Collect Provenance)
    A_PW(Query Operation Provenance Wrapper)

    A_QP(Quad Pattern)

    B_RQP[RDF Resolve Quad Pattern]


    B_RDr[RDF Dereference]

    B_RM[RDF Metadata]

    B_RME[RDF Metadata Extract]

    A_HC(Hydra Count)
    A_SS(SPARQL Service)
    A_AP(Annotate Provenance)
    B_RHL[RDF Hypermedia Links]

    B_RRH[RDF Resolve Hypermedia]

       %% connections

    B_I  --> A_CS
    A_CS --> B_OQO
    A_CS --> B_QO

    B_OQO --> A_PW

    B_QO --> A_QP
    B_QO --> A_CP
    B_QO --- O_QO

    A_QP --> B_RQP

    B_RQP --> A_Hm
    B_RQP --> A_F
    A_F --> B_RQP

    A_Hm --> B_RDr
    A_Hm --> B_RM
    A_Hm --> B_RME
    A_Hm --> B_RHL
    A_Hm --> B_RRH

    B_RME --> A_HC
    B_RME --> A_SS
    B_RME --> A_AP

    style A_F fill:#f9f,stroke:#333,stroke-width:4px
    style B_RME fill:#f9f,stroke:#333,stroke-width:4px
    style A_AP fill:#9f9,stroke:#333,stroke-width:4px

    style A_PW fill:#9f9,stroke:#333,stroke-width:4px
    style A_CP fill:#9f9,stroke:#333,stroke-width:4px
    style O_QO fill:#9f9,stroke:#333,stroke-width:4px

Web App

The Web App is the main entry point, allowing the user to select and execute an RML Mapping describing which Data Provider to consume data from, and how that data will be transformed to RDF.

Upon selecting an RML Mapping, the user can view and download its contents through the corresponding “RML Rules”-card (see Demonstrator). Using this RML Mapping, the user can inspect how its data will be processed prior to execution, and without requiring to inspect the source code.

At this point, the raw RDF RML description is shown to the user, however, since it is a structured semantic format, automatic visualizations and explanatory descriptions can be created.

The Web App guides the user through the necessary authorization steps, prior to execution. Given the vulnerable information being exchanged, communication with the Web App occurs over HTTPS.

After successful execution, the generated RDF and provenance become available for inspection and download. The provenance information, structured using the W3C recommended standard PROV-O, allows further automatic processing to validate that all data is processed correctly according to the user’s expectations.

By providing the user not only with the generated RDF but also with the data provenance, we address the transparency requirement other solutions lack.

The resulting RDF is stored onto the user’s Solid pod, which can be verified through the “Solid”-card (see Demonstrator).

As a result, this web app allows to unambiguously define the user’s data using an RML Mapping and transparently transfer it between services. The automatically generated provenance allows for inspection and validation of the processing.


We configure our application using a single configuration file: configuration.json. This file has a top-level configurationRecords key in which the so-called configuration records reside.

Such a configuration record contains at least an "id" and a "type":

  "id": "...",
  "type": "..."

Depending on its type, the configuration record may contain additional properties. The following list provides the details for each type of configuration record:

NPM Modules

All components of the PROV4IDaTa platform are published to under the @prov4itdata organization. The packages are briefly described in the following list:

Relation to DTP

After reviewing the DTP repository, we concluded that although utility functions could be reused at a later stage, we currently focus on an end-to-end system using solely and Comunica. On the one hand, because these technologies allow more advanced data transfer processes than DTP (such as joining data from different services on the fly), on the other hand, because integration efforts would put a too high burden on the current development sprints.


We set out the following requirements, and linked them to the specific sections of the features we currently support,

In general, our system is comparable to DTP, as it supports (and is extensible to) multiple data sources, and allows data import into SOLID pods (see this section).


The upper part of the landing page provides the means for quickly initiating the transfer from a service to a Solid pod. Once the user selects the desired RML Mapping, the transfer can be initiated by clicking the [Execute]-button. Initially, the user will be prompted to authorize with the Solid pod and the service defined as the source in the RML Mapping.

The lower part allows the user to review

Furthermore, the user can inspect and verify that the generated RDF was successfully stored on to the Solid pod.


The walkthrough above illustrates the flow of transferring data from a Data Provider (in this case, Flickr) to a Solid pod.

  1. First, the user selects an RML Mapping which comes available for inspection and download within the “RML Rules”-card.
  2. Once the user has decided on which RML Mapping to use, the transfer process is triggered by pressing [Execute].
  3. When transferring for the first time, the user will have to log in to the Solid pod.
  4. Once the user has authenticated with Solid, the user will have to authenticate with multiple Data Providers (in this case, Flickr, Imgur, and Google Contacts).
  5. At this point, the actual transfer takes place. Upon success, the user will be notified and the generated RDF, accompanied by its provenance, will be available for inspection and download in the “Generated RDF”-card and “Provenance”-card, respectively.
  6. Finally, the user can inspect the data being stored on the Solid pod through the “Solid”-card.

Use cases

Our toolchain is extensible to a wide variety of services. We opted to initially support Flickr and Imgur. Both services share a common purpose: uploading and sharing image-content. However, despite this commonality, they differ in various aspects such as the underlying data model and how the resources should be accessed. Additionally, we support the Google People API. This latter connector has a similar access method to Imgur (JSON Web API via OAuth 2.0), but contains data in a very different data model.

There are four different resources to model

Our current use cases already showcase the flexibility of our approach, however, more use cases will be supported in the following sprints.

Best-practice vocabularies


Images can be mapped to schema:ImageObject-resources, which inherits properties from schema:MediaObject that can be used for describing the height, width, and more.

An Image Gallery can be represented using schema:ImageGallery. Furthermore, the images it contains can be linked using the schema:hasPart property.


A Collection can be modeled through a schema:Collection, which can be linked to its Image Gallery resources through the schema:hasPart property.


A Person can be modeled using a schema:Person.



We can model an Image as a dcat:Distribution.

dcat:Distribution A specific representation of a dataset. A dataset might be available in multiple serializations that may differ in various ways, including natural language, media-type or format, schematic organization, temporal and spatial resolution, level of detail or profiles (which might specify any or all of the above).

The dcat:Distribution class has the following properties

We can model an Image Gallery as a dcat:Dataset, which can point to zero or more dcat:Distributions.


DCAT contains a dcat:Catalog class, a curated collection of metadata about resources (e.g., datasets and data services in the context of a data catalog). A dcat:Catalog can be linked to zero or more dcat:Datasets.


Flickr is an online photo management and sharing application. Its resources are made available through the Flickr API. Flickr follows the OAuth1.0a protocol which requires that requests to protected resources are signed using the Consumer Secret and Token Secret. By specifying the protocol in the RML Mapping the RMLMapper-JAVA takes care of the necessary steps for creating requests to protected resources. This also contributes to the extensibility of our solution: when a service decides to change to another protocol, only changes to the RML Mapping must be made. Hence, avoiding the need for rebuilding code.


Flickr Photoset resource schema:ImageGallery
id schema:identifier
title._content schema:name
description._content schema:description
Flickr Collection resource schema:Collection
id schema:identifier
title schema:name
description schema:description

Using DCAT


Imgur, an image hosting and sharing website, enables its users to quickly upload and share images and GIFs on social media platforms (e.g. Reddit, Twitter, etc.). Unlike the Flickr API, the Imgur API uses OAuth 2.0. When making requests for protected resources it suffices to add a bearer token to the HTTP headers.

The data fields mapped from the Imgur image resources are


An Imgur image can be mapped to a schema:ImageObject, along with the following properties:

Imgur image resource schema:ImageObject
id schema:identifier
title schema:name
description schema:description
link schema:url *
schema:image *
type schema:encodingFormat
height schema:height
width schema:width
views schema:interactionStatistic

*Multiple properties are suitable for mapping the link property.

Using DCAT

Google People API

Using the Google People API we can transfer our Google contacts.

Please note that this use case can be extended easily to other Google Products.


A Google contact can be mapped to a schema:Person, along with the following properties:

Google contact resource schema:Person
givenName schema:givenName
familyName schema:familyName
displayName schema:alternativeName

Adding data providers

This section will walk you through the different steps of adding a data provider to the PROV4ITDaTa Web-App, where the Google People API serves as an example.

Google People API

First, we discuss the configuration steps on the Google Cloud Platform. Secondly, we elaborate on how to define the Google People API as a logical source in an RML Mapping.

Google Cloud Platform configuration



  "defaults": {
    "origin": "https://localhost:3000",
    "transport": "session"
  "imgur": {
  "flickr": {
  "google": {
    "key": "<Client ID goes here>",
    "secret": "<Client secret goes here>",
    "callback" : "https://localhost:3000/google/callback",
    "scope" : [ ""]

For more information on setting up a Google app that uses the Google People API, check out

Creating an RML Mapping that consumes and transforms the Google People API

When creating an RML Mapping, you always have to define exactly one rml:logicalSource for every triples map you define. This way, the RML Processor knows where and how to access the data to be mapped. Using the rml:logicalSource’s rml:source property we can specify the URL of the Web API.

To access protected resources, the RML Processor needs to include the required credentials when consuming the API. The Google People API uses OAuth 2.0, and an authorization header should be added to the requests. Since these credentials are managed by our web-app, the value for the authorization header is a template-variable, hence, the web-app will recognize this and fill in the value.

The following is an excerpt of the rml:source defined in rml/google/contact-transfer-using-schema-org.ttl:

    ex:AuthorizationHeader "{{authorizationHeader}}" ;
    schema:name "Google API" ;
    schema:url <> ;
    <> schema:WebAPI .


Within this project, we envisioned the following features.

Use Open Standards and Open Source

By supporting existing standards where possible, we aim to minimize the required foundational work. Widespread adoption and understanding of existing standards make this possible.

PROV4ITDaTa is an Open Source initiative to transparently transfer data from Data Providers to your personal data vault.

It makes use of Open Standards such as:

It contributes to the following Open Source projects:

Mapping files to transfer data

In PROV4ITDaTa, the original Data Provider data is transformed into well-defined RDF knowledge graphs. As such, the resulting data have a clear context and meaning by using established semantic ontologies, and the original data is made interoperable.

Instead of relying on a hard-coded system to generate the RDF knowledge graph, we make use of the RDF Mapping Language: a language to declaratively describe how to generate RDF knowledge graphs from heterogeneous data sources. These mapping files are created manually by experts, to ensure that established ontologies are used, and high-quality knowledge graphs are generated.

As the RML Mappings can include data transformations, we can ensure data is cleansed during the data transfer process.

The resulting data transfer process adheres to the FAIR principles, as the RML Mappings are

As such, the data transfer process is fully transparent.

For each Data Provider, a mapping needs to be created manually (as is the case in DTP where a connector needs to be created for each service). The input of a mapping is defined by describing how a Data Provider’s Web API will be accessed (endpoint URL, authorization protocol, response format, etc.). This way, the RMLMapper knows how to consume the Web API. Hence, the actual input of a mapping is the response returned by the Web API. Furthermore, the input will be mapped according to the rules in the mapping, resulting in semantically sound RDF which is stored on the user’s Solid pod.

Therefore, once the input of mapping is described, it can easily be reused for creating other mappings for that Data Provider. Our advantage is that the mapping process is transparent and more easily adaptable when the data model changes.

Automatic Data Provenance Generation

Provenance and other metadata are essential for determining ownership and trust. However, defining such metadata typically stayed independent of the generation process. In most cases, metadata is manually defined by the data publishers, rather than produced by the involved applications. In PROV4ITDaTa, we rely on the RML Mappings that specify how the RDF knowledge graphs are generated to automatically and incrementally generate complete provenance and metadata information of the data transfer process. This way, it is assured that the metadata information is accurate, consistent, and complete. The resulting provenance information can be inspected by the user once the transfer process is completed.

The provenance information is described using the W3C recommended PROV-O standard. It covers the RDF knowledge graph generation, including metadata for the mapping rules definition and the data descriptions. By automating the provenance and metadata generation relying on the machine-interpretable descriptions in the RML Mappings, metadata is generated in a systematic way and the generated provenance and metadata information becomes more accurate, consistent, and complete.

Because of this provenance information, the data transfer process is not only fully transparent before, but also after processing the data.

Output RDF

By relying on Semantic Web technologies, we achieve interoperability on multiple levels.

Syntactic Interoperability

Our solution provides multiple levels of Syntactic Interoperability by leveraging the standardized Turtle-syntax for

  1. Defining the RML Mappings, which includes describing the Data Provider’s Web APIs.
  2. Representing the RDF and provenance generated by the RML Mapper.
  3. Storing the resulting RDF on the Solid pods.

Semantic Interoperability

Semantic Interoperability is achieved when multiple systems are able to develop the same understanding of the data. To this end, the following choices were made

Structural Interoperability

Our solution solely produces RDF data, which by itself, offers the means to accommodate for Structural Interoperability.

Data Compatibility

Storing the transformed data, provenance, and other metadata using the RDF model enables us to correctly integrate the resulting data sets from different Data Providers.

Data Portability

By transferring data from Data Providers to the user’s Solid pod we leverage Data Portability, which is one of Solid’s primary features.

Security and Privacy

As there are multiple parties involved in the data transfer (the user, Data Providers, Solid pods, and PROV4ITDaTa Web App) not one person or entity can fully ensure the security and privacy of the entire system. Instead, responsibility is shared among all the participants. Here are some of the responsibilities and leading practices that contribute to the security and privacy of the system. We will describe how these responsibilities are tackled, specifically in PROV4ITDaTa.

Data Minimization

When transferring data between providers, data minimization should be practiced. Practically this means that all parties should only process and retain the minimum set of data that is needed to provide their service. In the PROV4ITDaTa components, no data is stored, only processed. The user can inspect exactly which data fields are processed by inspecting the RML Mapping, and how these data fields are processed by inspecting the provenance data. All generated data is sent to the Solid pod, which is under the full control of the user.

User Control

All PROV4ITDaTa transfer processes are initiated by the user, never automatically. For each transfer process, the user needs to (re-)authenticate the PROV4ITDaTa process — for both the Data Providers as the Solid pod — using standardized authentication mechanisms such as OAuth where possible. Also, no authentication tokens are stored after the process completes. This to guarantee no unwilling data transfer processes are initiated.

The user further has full control to which Solid pod its data is stored, and by providing multiple alternative RML Mappings, it can personalize how the data is processed.

Currently, a single Solid pod provider is supported ( Our roadmap includes the possibility to choose your Solid pod provider.

Minimal Scopes for Auth Tokens

Where possible, only minimal (read-only) scopes are requested for Auth Tokens at the different Data Providers. This further increases transparency into exactly what data will be moved, and increases security so that if tokens are somehow leaked they have the minimal possible privilege.

For example, for the Flickr-service we only request the read scope, as this is possible in the Flickr API. For the Imgur API, however, there is no option to set a specific scope, so the default scope is used.

PROV4ITDaTa does not delete data from the Data Providers as part of the transfer. This functionality is left to the Data Providers.

Data retention

PROV4ITDaTa stores data only for the duration of the transfer process, and all data transfer uses transport layer security over secure HTTPS connections. The provenance information is available for the remainder of the user session, and storage of that provenance information needs to be initiated by the user if needed.

All authentication tokens are stored solely on the client-side for the duration of the transfer process, and are thus ephemeral: these tokens are automatically removed when the user’s browser session ends.


The Data Providers should have strong abuse protections built into their APIs. Since PROV4ITDaTa retains no user data beyond the life of a single transfer, the Data Providers have the best tools to be able to detect and respond to abusive behavior, e.g., using standard protocols such as OAuth to obtain API keys.


In PROV4ITData, RML Mappings are used to configure the data transfer process. Multiple RML Mappings are available for the user, allowing for personalization: the user can inspect the different RML Mappings, see which data fields are being processed, and based on its context, choose which RML Mapping to execute.


In PROV4ITDaTa, we ensure quality, both on data level as on software level.

Data Quality

The RML Mappings allow to transform data from heterogeneous data sources into RDF Knowledge graphs. The toolchain has been used in a variety of existing projects and use cases, and allow to clean the original data by means of FnO functions.

The mappings are created manually by experts to ensure best practices, and use established vocabularies.

Software Quality

All processing tools have been tested and applied in production-like environments (TRL7), and consist of unit and integration tests for

At, you can compare our RMLMapper-JAVA processor with other RML processors.

All PROV4ITDaTa components are published on NPM ( and on Github, and have gone through integration tests.


As defined by Article 4 of the GDPR, PROV4ITData acts as a data processor for the end-user (which become their own data controllers, conform to the EU digital sovereignty vision), as it allows individuals to easily transfer their files and data directly from Data Providers to their Solid pods.

This does not influence the GDPR roles of its complementary modules. Each Data Provider will maintain full control over determining who has access to the data stored on their systems. Access tokens need to be requested from each Data Provider the user would like to transfer data from, and from each Solid pod it would like to transfer data to. PROV4ITDaTa will not mediate data access rights between Data Providers or Solid pods. This ensures that API quotas continue to be managed by the Data Providers, thereby helping to mitigate traffic spikes and negative impacts across Data Providers.

The security requirements are in line with Article 32 of the GDRP:

At the time of writing, consent of the user is given via the authorization flows in the different Data Providers and Solid pods. Our privacy policy is available at


Currently, PROV4ITData showcases how users can transparently transfer their data from different Data Providers to their Solid pod, using the toolchain. We showcase transparency (by providing the RML Mappings the and provenance data), and flexibility (by providing multiple Data Providers).

In our roadmap, we envision a general improvement of our demonstrator, include a provenance-aware query-based importer using Comunica, and provide integration alternatives with, e.g., DTP.




If you are using or extending PROV4ITDaTa as part of a scientific publication, we would appreciate a citation of our article (PDF|HTML).

  author = {De Mulder, Gertjan and De Meester, Ben and Heyvaert, Pieter and Taelman, Ruben and Verborgh, Ruben and Dimou, Anastasia},
  title = {{PROV4ITDaTa:} Transparent and direct transfer of personal data to personal stores},
  booktitle = {Companion Proceedings of the The Web Conference},
  year = 2021,
  month = apr,
  url = {},