Skip to main content
Version: 3.1.1 (latest)

Checksum Action policy

The ability to check that data is consistent between Source and Target Filesystems is business critical. Therefore, the new Migration Action Policy “Skip if file checksums are consistent” was introduced to improve data integrity.

Checksums from the Source Filesystem are stored as metadata on Target files/objects/blobs. In some cases (HDFS Targets) the Target checksum is also stored as part of the metadata. When judging whether a file needs to be migrated, if the file already exists with the correct size on the Target then the checksums are checked, if they are consistent then the file is not migrated.

This results in a stronger check than just the Skip If Size Match Action Policy.

Checksum Action Policy

The only supported Filesystems that Checksum Action Policy can be applied to are HDFS, S3, GCS and ADLS. Both Source and Target file systems must be supported Filesystems.

caution

There is a performance cost to supporting this Action Policy due to extra RPC calls. HDFS Filesystems require the largest increase in RPC calls, so they have a comparatively larger performance cost.

note

For S3-compatible sources, Data Migrator requires that the storage implements the ETag on objects.

Updating an existing Migration to use the Checksum Action Policy

The Action Policy of a Migration can be updated whilst the Migration is being Reset. As long as both the Source and Target Filesystems support the Checksum Action Policy, there should be an option available to select when Resetting the Migration.

It is an expensive operation to switch to using this Action Policy, as all files will be retransferred again with the new checksum metadata. This will impact on the licence usage.

However, there is an alternative that will allow using the CRC32C checksums that were introduced in Apache Hadoop 3.1.1 that will not require the remigration of files.

  • HDFS -> GCS
  • HDFS -> HDFS

If these respective Filesystems support CRC32C and are configured to use this checksum type, then the files that are consistent will not be re-transferred when the Migration is Reset.

To enable this functionality, the following properties need to be set on both the Source and Target Filesystems.

  • For HDFS Filesystems the property dfs.checksum.combine.mode=COMPOSITE_CRC tells the HDFS to calculate combined CRC of individual CRCs instead of calculating MD5-of-MD5-of-CRCs. This can be set on the CLI using the --properties flag with the filesystem add hdfs command or the filesystem update hdfs command.

  • For Google Cloud Storage, set the properties fs.gs.checksum.type=CRC32C and dfs.checksum.combine.mode=COMPOSITE_CRC. These properties otherwise default to NONE, meaning the checksum action policy will rely on MD5 checksums. These properties must be set through the CLI or the API.

note

After existing Filesystems have their properties updated, LDM needs to be restarted to allow these changes to take effect. More information here.