Configure an ADLS Gen2 source
Migrate data from an Azure Data Lake Storage Gen2(ADLS Gen2) file system by configuring it as your source file system for Data Migrator.
You can add your ADLS Gen2 file system as either a live or static source.
To add your ADLS Gen2 source as live, the change feed option must be enabled on the storage account.
When adding a ADLS Gen2 source as a static source use the One-time Migration
option in the UI or the --scan-only
flag when adding with the CLI.
Migrations added to a live ADLS Gen 2 source support both Live and One-time Migration types.
Find all migration types and target file systems supported here.
If adding an ADLS Gen2 as a live source, all live migrations will have Target Match enabled when created with this file system. See the following link for more information on Target Match.
Limitations for live ADLS Gen2 source
- The change feed events can be delayed by approximately 1 minute before being visible to Data Migrator.
- For a live ADLS Gen2 source, the change feed option must be enabled on the storage account.
- Events are not strictly ordered in ADLS2, due to this, Data Migrator takes a cautious approach to directory renames. Directory renames sometimes cause rescans of the paths involved in the rename, this can lead to additional work for Data Migrator but is necessary to ensure a consistent source and target.
- Contact Support for further information and any questions about live ADLS Gen2 as a source.
There is a known limitation of the az storage fs file append
az command. The current implementation of this command doesn't produce events in the file systems event stream.
Use alternative commands or methods if you want to test the live replication functionality with Data Migrator.
Add ADLS Gen2 source with the UI
From the Dashboard, select an instance under Instances.
In the Filesystems & Agents menu, select Filesystems.
Select Add source file system.
Select Azure Data Lake Storage (ADLS) Gen2.
Enter the following details:
- Display Name - Enter a name for your file system.
- Data Lake Storage Endpoint - The storage endpoint to connect to. You can override the default value (dfs.core.windows.net) by replacing it with a custom or private endpoint.
- Authentication Type - The authentication type to use when connecting to your file system. Select either Shared Key or Service Principal (OAuth2).
If you use Shared Key as the Authentication Type. Enter the following details:
- Account Name - The Microsoft Azure account name that owns the data lake storage.
- Access Key - The access key associated with the Microsoft Azure account.
- Container Name - The ADLS Gen2 container you want to migrate data from.
If you use Service Principal (OAuth2) as the Authentication Type. Enter the following details:
- Account Name - The name of your ADLS Gen2 storage account.
- Container Name - The ADLS Gen2 container you want to migrate data from.
- Client ID - The client ID (also known as application ID) for your Azure service principal.
- Secret - The client secret (also known as application secret) for the Azure service principal.
- OAuth2 Endpoint - The client endpoint for the Azure service principal. Use the format
https://login.microsoftonline.com/<tenant>/oauth2/v2.0/token
where<tenant>
is the directory ID for the Azure service principal.
Select Use Secure Protocol to use TLS to connect to the Azure Data Lake Storage. Enabled by default.
Under Filesystem Options, select either Live Migration or choose One-time Migration to limit migration types available to one-time and recurring. See migration types to learn more about each type..
Select Save to add the file system.
Add ADLS Gen2 source with the CLI
Use either the filesystem add adls2 oauth
or filesystem add adls2 sharedKey
command depending on your file system's authentication type.
Specify the --scan-only
option to configure the source file system as non-live.
Oauth
Add a live ADLS Gen2 source file system using the filesystem add adls2 oauth
CLI command, which requires a service principal and OAuth 2 credentials.
See the official Microsoft documentation to find out more about Oauth and Azure.
Live Oauth example
filesystem add adls2 oauth --file-system-id myLiveSource
--storage-account-name myadls2
--oauth2-client-id b67f67ex-ampl-e2eb-bd6d-client9385id
--oauth2-client-secret 2IPO8__Secret__-9OPs8n*TexampleHJ=
--oauth2-client-endpoint https://login.microsoftonline.com/something/oauth2/v2.0/token
--container-name myContainer
--source
Add a non-live ADLS Gen2 source file system. Use the --scan-only
option to configure the file system as non-live. Live changes from the event stream are not replicated.
Non-live Oauth example
filesystem add adls2 oauth --file-system-id mySource
--storage-account-name myadls2
--oauth2-client-id b67f67ex-ampl-e2eb-bd6d-client9385id
--oauth2-client-secret 2I____Secret____n*TexampleHJ=
--oauth2-client-endpoint https://login.microsoftonline.com/something/oauth2/v2.0/token
--container-name myContainer
--scan-only
--source
See the command reference for all options when using the filesystem add adls2 oauth
.
Shared key
Add a live ADLS Gen2 source file system using the filesystem add adls2 sharedKey
CLI command which requires credentials in the form of an account key.
Live Shared key example
filesystem add adls2 sharedKey --file-system-id myLiveSource
--storage-account-name myadls2
--container-name myContainer
--shared-key Yi8NxHGqoQ79DBGLVn+COK__EXAMPLE_SHARED__vaS/NbzR5rtjEKEY31eIopUV
--source
Add a non-live ADLS Gen2 source file system using the filesystem add adls2 sharedKey
.
Use the --scan-only
option to configure the file system as non-live. Live changes from the event stream are not replicated.
Non-live Shared key example
filesystem add adls2 sharedKey --file-system-id mySource
--storage-account-name myadls2
--container-name myContainer
--shared-key Yi8NxHGqoQ79DBGLVn+COK__EXAMPLE_SHARED__vaS/NbzR5rtjEKEY31eIopUV
--scan-only
--source
See the command reference for all options when using the filesystem add adls2 sharedKey
.
Manage source filesystem
Remove or show the details of your source ADLS Gen2 file system using the CLI.
Additional information
Azure identity transformer superuser replacement
By default, when migrating from an ADLS Gen2 source, $superuser
is replaced with the current Data Migrator user when it appears as the owner or owning group of a file or directory.
This behavior is controlled by the fs.azure.identity.transformer.skip.superuser.replacement
property, which defaults to false.
To adjust this behavior and retain the $superuser
ownership, set this property to true
when adding your ADLS Gen2 source filesystem.
This property apples to $superuser
ownership only.
filesystem add adls2 sharedKey --file-system-id adlsSource --storage-account-name StorageAC1 --container-name container1 --shared-key M2oSHAREDKEY2pNL== --source --scan-only --properties fs.azure.identity.transformer.skip.superuser.replacement=true
You can find the fs.azure.identity.transformer.skip.superuser.replacement
property defined here in the official Hadoop documentation.