Auto source cleanup
Auto source cleanup is a feature that removes data from the source filesystem after the data is migrated successfully to a target.
To use this feature in production, please contact WANdisco Support.
Don't enable auto source cleanup for migrations with migration verifications.
Using auto source cleanup and migration verification on the same migration will cause the verification report to list files intentionally deleted during auto source cleanup as discrepancies.
Prerequisites
You have the Admin or Migration Manager role assigned.
Learn more about user roles.
You have an ingest process set up to copy folders you want to migrate and clean up to a temporary space on your source filesystem.
infoTo protect production data from being deleted, copy your data to a temporary folder.
Add the temporary paths to your migration when you're creating it.
Don't create a migration with auto source cleanup enabled on a production dataset path (for example,
/data1
). Instead, create a new folder (for example,/mig1
) and make periodic copies of the data from/data1
to/mig1
. When the data arrives in/mig1
, Data Migrator transfers it and then removes it. This derisks inadvertently removing production data.For local and NAS filesystems, Data Migrator needs write access to source files to perform auto source cleanup. Without write access, auto source cleanup won't work and files won't be removed from the source.
Use cases
If you ingest large volumes of data, we migrate the data to your cloud target and clean it from your source Hadoop Distributed File System (HDFS), local filesystem, or network-attached storage (NAS) to free up space, keeping your buffer size and costs to a minimum.
Data Migrator does the following:
- Checks that source files exist on the target before removing them from the source.
- Ignores files you’ve specified to exclude from the migration to your target so these aren't removed from your source. See Configure exclusions.
We support this feature for the following use cases:
- HDFS to HDFS
- HDFS to Amazon Simple Storage Service (Amazon S3)
- HDFS to Google Cloud Platform (GCP)
- HDFS to Azure Data Lake Storage Gen2 (ADLS Gen2)
- Local filesystem and NAS to your chosen target
Manage auto source cleanup with the UI
Enable auto source cleanup
You can enable auto source cleanup when you create a migration or at any time afterward.
- In the WANdisco® UI, create a migration with HDFS, NAS, or local filesystem as a source, and a cloud target.
- Under Migration Type, select Live, Recurring, or One-time.
- Under Advanced Options, select Enable source cleanup.
- Select a deletion mode:
- Immediately - Delete files from the source after verifying they’re on the target.
- After a file has not changed for - Delete files from the source after a selected period of no activity.
- Enter the minimum number of hours or days you want files to have existed on the target before being deleted from source. Select hour(s) or day(s) from the dropdown.info
Review the ingest contract for your selected deletion mode for guidance on interacting with your migration paths while auto source cleanup is enabled.
- Enter the minimum number of hours or days you want files to have existed on the target before being deleted from source. Select hour(s) or day(s) from the dropdown.
- Select the acknowledgement checkbox(es) to enable the auto source cleanup feature.
- Continue creating your migration.
Source files that are being updated after you’ve enabled auto source cleanup and started migrating data from the source won't be migrated or removed. You will receive notifications to this effect in Data Migrator.
Reenabling auto source cleanup will require a rescan of source data.
If you reenable auto source cleanup, it defaults to the previous deletion mode (Immediately or After a file has not changed for a specified amount of time).
Changing (enabling, disabling, or reenabling) auto source cleanup settings will reset a non-recurring migration. If the migration was in a Running, Live, Scheduled, or Completed state before the change, it will restart.
Disable auto source cleanup
You can disable auto source cleanup at any time after enabling it.
- In the WANdisco® UI, go to the migration for which you want to disable auto source cleanup.
- Select the Auto Source Cleanup tab.
- Uncheck the Enable source cleanup checkbox.
- Select Save.
This will return the migration to the state it was in before auto source cleanup was enabled. When auto source cleanup is disabled, data will not be removed from the source filesystem after the data is migrated successfully to a target.
Check if auto source cleanup is enabled
Select the migration from your dashboard and go to the Auto Source Cleanup page. If the Enable source cleanup checkbox is selected, auto source cleanup is enabled.
Monitor the cleanup
Only relevant to live migrations. Recurring and one-time migrations don't show unsupported events.
On the Notifications page, you can view notifications for “unsupported events” on the source.
Unsupported events include changes made to files and directories that you added to the migration for which cleanup is enabled. Because we can’t remove source files or directories that are changing, we notify you of these events including file or path renames or new files added to paths, for example.
Manage auto source cleanup with the CLI
You can use auto source cleanup with the CLI using the migration add
, migration auto source cleanup
, and migration update configuration
commands.
You can enable auto source cleanup when you create a migration or at any time afterward.
Create a new migration with auto source cleanup
Create a migration with auto source cleanup enabled using the migration add
command with the --deletion-mode
and --delayed-deletion-period
parameters:
migration add --migration-id migration1 --path /examplePath --target exampleTargetFS --deletion-mode DELAYED_DELETION --delayed-deletion-period 12H
Enable auto source cleanup for an existing migration
Enable auto source cleanup with the CLI using the migration auto source cleanup
command:
migration auto source cleanup [--migration-id] string
[--deletion-mode] string
[--delayed-deletion-period] string
[--action-policy] string
Mandatory parameters
--migration-id
The ID of the migration you want to update.
Optional parameters
--deletion-mode
The deletion mode for the migration. There are three options available:IMMEDIATE
Delete files from the source after verifying they’re on the target.DELAYED_DELETION
Delete files from the source after a selected period of no activity. If you useDELAYED_DELETION
, you need to specify a time period using the--delayed-deletion-period
parameter.NO_DELETION
Don't delete files from the source. Use this option to disable auto source cleanup.infoReview the ingest contract for your selected deletion mode for guidance on interacting with your migration paths while auto source cleanup is enabled.
--delayed-deletion-period
The minimum number of hours (H) or days (D) you want files to have existed on the target before being deleted from source. For example,6H
(six hours).--action-policy
This parameter determines what happens if the migration encounters content in the target path with the same name and size. In the UI, this is called Skip Or Overwrite Settings.
There are two options available:com.wandisco.livemigrator2.migration.OverwriteActionPolicy
(default policy)
Every file is replaced, even if file size is identical on the target storage. In the UI, this is called Overwrite.com.wandisco.livemigrator2.migration.SkipIfSizeMatchActionPolicy
If the file size is identical between the source and target, the file is skipped. If it’s a different size, the whole file is replaced. In the UI, this is called Skip if Size Match.
migration auto source cleanup --migration-id migration1 --deletion-mode IMMEDIATE
migration auto source cleanup --migration-id migration1 --deletion-mode DELAYED_DELETION --delayed-deletion-period 6H
You can't change the deletion mode of a migration after you've configured it.
Disable auto source cleanup
You can disable auto source cleanup at any time after enabling it.
Disable auto source cleanup with the CLI using the migration auto source cleanup
command with the --deletion-mode
parameter set to NO_DELETION
:
migration auto source cleanup --migration-id migration1 --deletion-mode NO_DELETION
The deletionMode
value output by the migration show
command doesn't change when you disable auto source cleanup.
Check if auto source cleanup is enabled
Check if auto source cleanup is enabled using the migration show
command with the --detailed
flag:
migration show --migration-id migration1 --detailed
The command output contains the following values:
"deletionMode": "DELAYED_DELETION",
"delayedDeletionPeriodSeconds": 86400,
"autoSourceCleanupEnabled": true
autoSourceCleanupEnabled
displays true
if auto source cleanup is enabled and false
if disabled.
Update delayed deletion period without disabling
Update existing auto source cleanup settings using the migration update configuration
command:
migration auto source cleanup --migration-id migration1 --delayed-deletion-period 12H
Reporting
To check the correct files have been removed from your source and to ensure you have accurate information for auditing purposes, you can access reports which you can download and share.
Reports are:
- Created every four hours automatically.
The reporting period for the current date is four hours.
The first report runs for 00:00 - 03:59, the next for 04:00 - 07:59, 08:00 - 11:59, and so on. - Placed in a folder whose name is derived from the migration ID. The location of the folder is
/opt/wandisco/livedata-migrator/db/sourcecleanup
. - A record of all the files that have been removed from the source during cleanup.
- A record of what has been deleted successfully.
- Available for immediate and delayed deletes.
- Available for download in the following file formats:
.jsonl
(uncompressed)tar.gz
(compressed)
The four hour reports are compressed into a daily report.
You can view and download a report while a migration is still in progress.
The reporting period for archived reports is 24 hours, for example, from 00:00 to 23:59.
If a migration is reset, the reporting still captures files that were removed from the source before the migration was reset. All cleanup operations after the reset are captured in the same report. The cleanup report is simply added to a directory that contains the new name of the reset migration.
- UI
- CLI
Reporting with the UI
- Select a data migration for which auto source cleanup is enabled.
- Select Auto Source Cleanup and go to the Source Cleanup History panel.
If files were removed from the source, you can see the the report files generated. Download the files to view them:- In the last 4 hours under Latest Reports. For example,
21.02.2023-08:00:00.jsonl
,21.02.2023-12:00:00.jsonl
. - In the last 24 hours under Archived Reports. For example,
20.02.2023.jsonl.gz
.
- In the last 4 hours under Latest Reports. For example,
- To download reports to check which files were removed from your source filesystem and compare the results with your target filesystem, select the download icon for the report that matches your needs.
You can delete archived reports only.
Reporting with the CLI
Use the following commands for source cleanup reporting:
migration deletion-report list
View a list of recent cleanup reports.
migration deletion-report list [--migration-id] string
[--date] string
Mandatory parameters
--migration-id
Specify the ID of the migration for which you want to view a list of cleanup reports.
Optional parameters
--date
Enter the date if you want to view a list of all the cleanup reports after this date. Use one of the following date formats:- DD.MM.YYYY
- DD-MM-YYYY
- DD/MM/YYYY
migration deletion-report download
Download source cleanup reports.
migration deletion-report download [--migration-id] string
[--report-name] string
[--out-dir] string
Mandatory parameters
--migration-id
Specify the ID of the migration for which you want to download a cleanup report.--report-name
Specify the names of the report you want to download for a migration.--out-dir
Specify the directory to which you want to download the report.
Example: Download a cleanup report
migration deletion-report download --migration-id ab123c03-697b-48a5-93cc-abc23838d37d-1668593022565 --report-name <example_22.02.2023-12:00:00.jsonl> --out-dir /user/ExampleCleanupDirectory
migration deletion-report delete
Delete source cleanup reports.
migration deletion-report delete [--migration-id] string
[--report-names] string
Mandatory parameters
--migration-id
Specify the ID of the migration for which you want to delete cleanup reports.--report-names
Specify comma-separated report names of the reports you want to delete for a migration.
View reports for deleted migrations
You can view reports for deleted migrations. After a migration is deleted in the UI, you can view the report in the directory /opt/wandisco/livedata-migrator/db/sourcecleanup
. The sub-directory names for the cleanup reports are derived from the migration IDs.
Download reports for deleted migrations
You can download reports for deleted migrations using the CLI command migration deletion-report download
. For more information, see the Command reference.
Ingest contracts
Immediately
Data Migrator can delete a file after it is made available for migration and successfully migrated to the target.
You can't interact with or modify paths within a migration with immediate deletion configured.
The only supported source filesystem operation for a migration with immediate deletion configured is moving content into the migration path atomically (using the
mv
command) from outside the migration.If you replace existing content on the source for a migration with immediate deletion configured, there is no guarantee that the new content will be migrated. The old version of the file may be migrated and the new version deleted.
Depending on the Skip or Overwrite Settings, you can replace content on the target for a migration with immediate deletion configured if you verify that the path on the source is empty before writing to it.
noteIf Data Migrator has deleted a path after successfully migrating it to the target, it is possible to rewrite the source content and expect that the new changes will be replicated to the target.
Confirm that Data Migrator deleted a path by checking it doesn't exist on the source or checking the audit log to see if it's registered as a deleted path.
New content written to the source path can be replicated safely by then adding a rescan directory to the path. For recurring migrations, the change will be picked up automatically in the future scan iterations.
infoIf the target action policy for the migration is
SKIP_IF_SIZE_MATCH
, the new changes will only be replicated if the file size has changed.In migrations with an event stream that have immediate deletion configured, Data Migrator ignores all events except for moving data into the migration from outside the migration.
After a file has not changed for x days/hours
- Data Migrator can delete each individual file after it meets the following criteria:
- The age of the file on the source is at least equal to the delay period
- Is a file
- The file on the source is older than the file on the target
- The file exists on the target and the source, and is consistent
- The file on the target is older than the delay period
- A file that can be deleted by Data Migrator is not guaranteed to be deleted immediately.
- Interaction with a file (reading/appending/replacing) ready for deletion is not safe or recommended.
- Delete operations are not supported while auto source cleanup is enabled. This is to prevent deletions made by Data Migrator being replicated to the target.