Data Provenance and Blockchain

Let us start with a quick understanding of each of these i.e. data provenance and blockchain.

Data Provenance is the field of recording the history of data, from its inceptions to various stages of the data lifecycle. Thus, data provenance helps provide a detailed picture of how the data was collected, where it was stored and how it was used. This record essentially forms an audit trail for the data itself.

Blockchain technology is a distributed de-centralized immutable ledger. It is currently widely in use for cryptocurrencies, but not limited to it. Basically, it can be looked at as a shared/semi-shared/private (i.e. permissioned or non-permissioned in the blockchain parlance), immutable ledger for recording sequence of events or history for transactions, which can be deployed to provide a high level of trust, accountability, and transparency.

Recommended blog post – Potential of Blockchain

With this background, now we will look at the following two use-cases around Data Provenance and how Blockchain can help here:

  1. Data Provenance in Data Lakes
  2. Data Provenance for Supply Chain Management

Data Provenance in Data Lakes

Data Lakes are gaining a lot of attention lately. Organizations are becoming aware of the differences between a data warehouse and data lakes. (Recommended blog post – The inevitable shift from Data Warehouses to Data Lakes) Data lakes hold data from multiple sources in the organization and that too for a longer duration. Thus requiring a means to record the history of data from its origin to the data lake and beyond is of utmost importance. In the technical parlance, this is also known as “Data Provenance”.

We look at how blockchain technology can help in the data provenance problem for data lakes. Data lakes ingest data from multiple sources in the organization. This source of the data can be considered as an account in the blockchain network. Thus every time the data lake ingests a chunk of data from the source, the source will need to record the raw data or its hash in the blockchain network as part of the audit trail. This record will be added using the source account in the blockchain and hence it will be signed by the system of origin. This will allow someone in the future to confirm that the corresponding data was indeed obtained from the trusted source by validating the block in the blockchain. The validation is performed by re-computing the hash on the data and comparing it with the hash recorded in the block. Note the record containing the hash was already signed by the system of origin, thus establishing that the data was indeed received from the trusted authentic source.

As this data gets cleaned and sanitized, the outcome of the operations is again recorded in the blockchain. Similarly, when the data gets rolled up or aggregated, it is recorded in the blockchain. Thus, creating a comprehensive immutable audit record of the data in the data lake, which can be independently verified to ascertain the authenticity of the data.

Note one can record the complete raw data (unencrypted or encrypted) in the blockchain or simply record the hash of the raw data. The former ensures that you have complete data available in the blockchain for future reference, but implies additional redundant storage as well. The latter ensures that your blockchain is more compact, whilst giving you the same level of verification benefits.

Data Provenance in Supply Chain Management

Supply Chain Management requires the ability to track the origin and movement of high-value items (i.e. goods) across a supply chain, such as luxury goods, pharmaceuticals, electronics, etc. Let us consider the journey of an expensive watch (say a Rolex) from the manufacturer to the buyer. This journey can itself go through multiple hops and any supply chain journey stretching over time and distance can potentially suffer from counterfeiting and theft. In such a scenario, how can the buyer rest assured about the authenticity of the purchased item?

Blockchains along with digital token can be used in these scenarios as follows – The manufacturer assigns a digital token to each high-value item in the supply chain. Whenever the physical item changes hands in the real world, the corresponding digital token is re-assigned in the blockchain. This ensures that the blockchain tracks the journey of the high valued item in the real world. This allows the buyer on receipt of the item to backtrack and verify the chain back to the origin i.e. all the way to the manufacturer.

In the fashion industry, Provenance Consulting has already partnered with fashion designer Martine Jarlgaard to build a blockchain solution, which aids in transparency and substantiating authenticity claims in the fashion industry. The solution helped to track and trace from the 1st mile i.e. from sourcing the materials for the apparels itself.

IBM has already started testing its blockchain technology with Walmart, which helps track a product from a farm all the way to its store shelves. This will aid from Food Safety perspective and is a clear example of the value that blockchain technology provides to SCM.

Summary

Data Provenance needs are quite similar across different industry verticals. It requires tracking the history of data/transactions in a manner, such that verification and validation are simplified while ensuring the integrity of the maintained data. Blockchain being an immutable distributed ledger lends itself nicely to Data Provenance needs across industry verticals.

References: