From Torrents to Git: How Merkle Trees Protect Your Files

Learn how Merkle trees are used in Bitcoin, Torrenting, Git, and DNSSEC to maintain data integrity and prevent tampering of information. Discover how Merkle

Written by

Todd Price

Published On

13 Mar 2023

As explained in a previous article, Merkle trees are a powerful and efficient data structure that provide a robust mechanism for ensuring data integrity in the Bitcoin system.

Merkle trees in Bitcoin and other systems

While Bitcoin has certainly popularised Merkle trees, there are a range of other systems which use a Merkle tree to satisfy the same integrity requirements.

Just as including the Merkle root in the block header ensures that any alteration of a transaction in the block will result in a different Merkle root, which will in turn invalidate the published header for that block other systems publish a Merkle root to a similar end.

Ensuring data integrity with Merkle trees

Distributing Merkle roots for other data sets can ensure the integrity of the set being shared or be useful in identifying if a single data object, that has been deconstructed into smaller parts, is missing any pieces or is ordered in the correct fashion.

Merkle trees in Torrenting and Git

It is for this reason that torrenting systems also use Merkle trees to ensure the integrity of downloaded data.

In a Torrent, the data is divided into fixed-size pieces, and each piece is hashed. These hashes are then combined in a Merkle Tree, with each leaf node representing a piece hash. The root of the tree is included in the Torrent file, allowing the downloader to verify the integrity of the downloaded data.

This is interesting as it is not necessarily only a data set that is being used to construct a Merkle tree from, but even a single file can be ‘Merklized’ by dividing it into smaller chunks which can then be provably all reconstructed to the specification of the original file.

In the case of torrents, this is done by the torrent client when creating a torrent out of a file where the chunk size can be selected.

Merkle trees for file integrity and version control

The popular version control system Git, invented by Linus Torvalds is used by probably millions of software developers worldwide to manage changes to their codebase in a collaborative environment.

It achieves this functionality through its use of Merkle trees. Through the lifecycle of software development, various versions of code snippets or documentation are iterated upon and without a version control system, this could become extremely messy.

Rather than saving files as config_version1.js, config_version2.2.js etc etc, git allows changes to a file to be committed to a code repository’s working tree in their iterated versions without having to change the filename to reflect that.

By leveraging the fact that a change in the content of a file will be reflected in a change in the hash of the file, this property can be used to create a kind of content addressability where rather than file names reflecting the change, its hash does instead.

Each commit in Git represents a snapshot of the repository at a specific point in time. When a developer makes changes to the codebase, Git creates a new commit that includes the changes and metadata such as the author and commit message.

To store this information, Git uses a directed acyclic graph (DAG) data structure, with each commit represented as a node in the graph. Each commit node in the DAG contains a reference to the parent commit(s), allowing Git to track the history of changes to the codebase over time.

This is similar to Bitcoin in that the chain of block headers is also a DAG, as is the ledger of unspent transaction outputs that will be used as inputs in future transactions.

Merkle trees are a type of DAG themselves and the Git data structure is sometimes referred to as a Merkle DAG.

Git creates its Merkle DAG for each commit by recursively hashing the contents of all files and directories in the commit.

Each file and directory is represented as a leaf node in the tree, with the hash of the file or directory contents as the leaf value. The parent node of each pair of leaf nodes is calculated by hashing their combined hash values, and this process continues recursively until a single hash, known as the Merkle Root, is produced.

The Merkle Root of the tree is stored in the commit object, along with other metadata such as the commit message and author.

By storing the Merkle Root in the commit object, Git ensures that any changes to the contents of the commit will result in a different Merkle Root, making it easy to detect any tampering attempts.

When a developer checks out a particular commit, Git verifies the integrity of the commit by recalculating the Merkle Root and comparing it to the stored value in the commit object.

If the calculated Merkle Root matches the stored value, the commit is considered valid, and the developer can be sure that the contents of the commit have not been tampered with.

This can allow for easy comparison between files or directories when working on branches of the main tree. Before a change is to be merged into the main tree, warnings can be presented that notify a user they will be replacing an old version with a newer one, or where other such conflicts arise.

It can also provide a mechanism for inspecting earlier commits, and algorithms which allow one to traverse the ‘difference of’ two particular versions of a file can allow a developer to navigate back to earlier states of the repository in the event of new work introducing breaking changes.

Merkle trees in DNSSEC

Other systems which use Merkle trees are the Domain Name System Security Extensions (DNSSEC) to distribute and validate DNS records. DNSSEC uses a Merkle tree to store the signatures of each DNS record in a zone file.

Securing DNS records with Merkle trees

By using Merkle trees, DNSSEC ensures that a malicious attacker cannot modify or alter the DNS records in a zone without being detected.

Merkle trees in Certificate Authorities

Certificate Authorities responsible for managing https security certificates use Merkle trees to create Certificate Transparency logs, which are used to publicly track issued SSL/TLS certificates. Each SSL/TSL certificate is represented as a leaf node in the Merkle tree, with the root hash included in a publicly available log.

While traditional Merkle Trees are built in a chronological manner, with each node representing a hash of its children, multidimensional Merkle Trees combine both chronological and lexicological trees to create a more efficient structure for traversing and querying.

By taking advantage of both dimensions, multidimensional Merkle Trees can significantly reduce the size of the tree and improve its performance. For instance, multidimensional Merkle Trees can be used in Git to group commits by date and time, as well as by the content of the commits.

Systems such as Merkle2 have developed multidimensional Merkle trees which allow for quick look up in Certificate Transparency logs.

Merkle trees for information security

It’s likely that one of the biggest BSV transaction generators CertiHash uses something like this in their Sentinel Node product as well to detect penetrations into their client’s systems.

Deduplicating data with Merkle trees

ForkBase uses the Merkle tree to provide a version control system with granularity down to the row of a particular CSV file, so that branchable applications can provide data deduplication.

This means that only the changes in a dataset need to be updated when a version is shared or uploaded to a single repository. For example, uploading a 300 KB CSV file to the platform, then offline adding a few extra bytes of data in the file, saving a new version of it and then uploading that to the platform only results in 300.001 KB of storage required rather than 600.001 KB.

Merkle trees for identity

Another amazing application of Merkle trees that was introduced by Dr Craig S Wright during his recent Bitcoin Masterclass series was for deconstructing the individual data elements of an identity document and creating a Merkle tree from those.

In that respect, if all the data fields of your passport or driver’s licence are individually leaf node values on a Merkle tree, then this allows someone the ability to selectively expose the required information only.

When a driver’s licence or passport is issued, the issuer could create a public registry of the Merkle roots from every data field value of the identity document that they have issued.

When someone then goes to a venue where they are required to show their proof of age to enter the venue, they can only expose their date of birth, photo and the complimentary values for those leaf nodes Merkle path back to the Merkle root.

The bouncer at the door can then recalculate the Merkle root that is listed on the public registry from the Passport Office or the body that issues driver’s licences and it can then be demonstrated that the photo matches and the guest is in fact of age.

Zero-knowledge proofs

Zero-knowledge proofs could even allow for the provision of an age provably over 18, yet with no specific details on their date of birth.

Other similar checks could also be issued where a local club might require guests from within a certain radius to become members and their home’s address could be provably verified to lie outside the particular zone without actually providing the guests address.

Websites that need to restrict access to their site to only those over the age of 18 could require the provision of a similar data envelope when authenticating the session and we can then limit the exposure of minors to gambling, pornography or R rated movies.

100 point identity checks for mundane services such as libraries, or real estate applications can be achieved without having to leak all your details to the service provider.

In the event that a tenant damaged a house, then the police could be notified that client identified by a particular Merkle root value hasn’t honoured their contract, and the relevant details could then be provided from the Passport Office with a hash based representation of any supplementary material such as a warrant, court order or infringement notice being logged on chain and correlating to the particular incident.

Merkle proofs create a win-win for honest actors

Such solutions significantly enhance the privacy of the clients when compared to the conventional methods of providing identity while at the same time they also enhance the capabilities of a party to achieve remedy for any damages incurred by a patron.

This is a win-win situation for all but those who seek to operate dishonestly. That in itself is then a huge win for the trust in our various social apparatuses.

News

Latest News

Get the latest updates on the Blockchain industry and BSV ecosystem, including new projects, developments, and events.

All News