Configure source filesystems
Configure source filesystems for each product to migrate your data from depending on what your environment is:
- Hadoop Distributed File System (HDFS) - Add one source filesystem only for each product.
- S3 sources (IBM Cloud Object Storage, Amazon S3) - Add one or more source filesystems.
Data Migrator supports the following filesystems as a source:
- Hadoop Distributed File System (HDFS)
- Amazon S3
- S3
- IBM Cloud Object Storage
- A local filesystem
- ADLS Gen2
- Mounted Network-Attached Storage (NAS)
- Google Cloud Storage (GCS)
Configure source filesystems with the UI
The Filesystems option under the Filesystems & Agents menu shows the source and target filesystems Data Migrator can use for data migrations.
Select Filesystems to:
- View and configure source and target filesystems.
- Add or remove targets.
Add a source filesystem
To add a source filesystem from your dashboard, select the following:
- From the Dashboard, select a product under Products.
- In the Filesystems & Agents menu, select Filesystems.
- Select Add source filesystem.
For information about configuring filesystem health check notifications and email alerts, see Configure email notifications with the UI.
If you have HDFS in your environment, Data Migrator automatically detects it as your source filesystem. However, if Kerberos is enabled, or if your Hadoop configuration doesn't contain the configuration file information required for Data Migrator to connect to Hadoop, configure a HDFS source with additional Kerberos configuration.
If you want to configure a new source manually, delete any existing source first, and then manually add a new source.
If you deleted the HDFS source that Data Migrator detected automatically, and you want to redetect it, go to the CLI and run the command filesystem auto-discover-source hdfs
.
- HDFS
- Amazon S3
- S3
- IBM Cloud Object Storage (preview)
- Local Filesystem
- ADLS Gen2 (preview)
Configure HDFS as a source
If a file is removed from the source after a one-time migration has completed the initial scan, subsequent rescans will not remove the file from the target. Rescanning will only update existing files or add new ones; it will not remove anything from the target.
One exception to this is the removal of a partition. Removing a partition is an action taken in Hive, and the metadata change will be replicated live by Hive Migrator. The data under that partition will remain on the target regardless of whether it has been deleted on the source. However, since the partition was removed in the metadata, the data inside won't be visible to queries on the target
Configure an S3 bucket as a source
- Select your Data Migrator product from the Products panel.
- Select Add Source Filesystem.
- Select S3 from the Filesystem Type dropdown list.
- Enter the following details:
- Display Name - The name you want to give your source filesystem.
- Bucket Name - The reference name of the S3 bucket you are using.
- Access Key - Enter the access key. For example,
RANDOMSTRINGACCESSKEY
. - Secret Key - Enter the secret key that corresponds with your access key. For example,
RANDOMSTRINGPASSWORD
. - S3A Properties - Data Migrator uses Hadoop's S3A library to connect to S3 filesystems. Enter key/value pairs to apply additional properties.info
You need to define an S3 endpoint using the
fs.s3a.endpoint
parameter so that Data Migrator can connect to your source. For example,fs.s3a.endpoint=https://example-s3-endpoint:80
.
- Select Save.
When migrating files from an S3 source to an HDFS target, the user that writes the files will be the file owner. In Data Migrator, this is the user that is mapped to the principal used to authenticate with the target.
Additionally, S3 object stores don't retain RWX permissions. Anything migrated from an S3 object store to an HDFS target will have 'rwxrwxrwx' permissions.
Configure IBM Cloud Object Storage as a source (preview)
Enter the following:
- Filesystem Type - The type of filesystem source. Select IBM Cloud Object Storage.
- Display Name - The name you want to give your IBM Cloud Object Storage.
- Access Key - The access key for your authentication credentials, associated with the fixed authentication credentials provider
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
.
Although IBM Cloud Object Storage can use other providers (for example InstanceProfileCredentialsProvider, DefaultAWSCredentialsProviderChain), they're only available in the cloud, not for on-premises. As on-premises is currently the expected type of source, these other providers have not been tested and are not currently selectable.
- Secret Key - Enter the secret key using this parameter, used for the
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
credentials provider. - Bucket Name - The name of your Cloud Object Store bucket.
- Topic - The name of the Kafka topic to which the notifications will be sent.
- Endpoint - An endpoint for a Kafka broker, in a host/port format.
- Bootstrap Servers - A comma-separated list of host and port pairs that are addresses for Kafka brokers on a "bootstrap" Kafka cluster that Kafka clients use to bootstrap themselves.
- Port - The TCP port used for connection to the IBM Cloud Object Storage bucket. Default is 9092.
Migrations from IBM Cloud Object Storage use Amazon S3, along with its filesystem classes. The main difference between IBM Cloud Object Storage and Amazon S3 is in the messaging services: SQS Queue for Amazon S3, and Kafka for IBM Cloud Object Storage.
Configure notifications for migration
Migrating data from IBM Cloud Object Storage requires that filesystem events are fed into a Kafka-based notification service. Whenever an object is written, overwritten, or deleted using the S3 protocol, a notification is created and stored in a Kafka topic - a message category under which Kafka publishes the notifications stream.
Configure Kafka notifications
Enter the following information into the IBM Cloud Object Storage Manager web interface.
- Select the Administration tab.
- In the Notification Service section, select Configure.
- On the Notification Service Configuration page, select Add Configuration.
- In the General section, enter the following:
- Name: A name for the configuration, for example "IBM Cloud Object Storage Notifications".
- Topic: The name of the Kafka topic to which the notifications will be sent.
- Hostnames: List of Kafka node endpoints (host:port) format. Note that larger clusters may support multiple nodes.
- Type: Type of configuration.
(Optional) In the Authentication section, select Enable authentication and enter your Kafka username and password.
(Optional) In the Encryption section, select Enable TLS for Apache Kafka network connections.
- If the Kafka cluster is encrypted using a self-signed TLS certificate, paste the root CA key for your Kafka configuration in the Certificate PEM field.
Select Save.
- A message appears confirming that the notification was created successfully and the configuration is listed in the Notification Service Configurations table.
Select the name of the configuration (defined in step 4) to assign vaults.
In the Assignments section, select Change.
In the Not Assigned tab, select vaults and select Assign to Configuration. Filter available vaults by selecting or typing a name into the Vault field.
Notification configurations can't be assigned to containers vaults, mirrored vaults, vault proxies, or vaults that are migrating data. Once a notification is assigned to configuration, an associated vault can't be used in a mirror, with a vault proxy, or for data migration.
Only new operations that occur after a vault is assigned to the configuration will trigger notifications.
- Select Update.
For more information, see the Apache Kafka documentation.
Configure a local filesystem as a source
Enter the following:
- Filesystem Type - The type of filesystem source. Select Local Filesystem.
- Display Name - Enter a name for your source filesystem.
- Mount Point - The local filesystem directory path to use as the source filesystem. You can migrate any data in the Mount Point directory.
Local filesystems don't provide change notifications, so Live Migration isn't enabled for local filesystem sources.
Configure Azure Data Lake Storage (ADLS) Gen2 as a source (preview)
ADLS Gen2 as a source is currently a preview feature, and is subject to change.
You can use ADLS Gen2 for one-time migrations only - not for live migrations.
Enter the following:
- Filesystem Type - The type of filesystem source. Select Azure Data Lake Storage (ADLS) Gen2.
- Display Name - Enter a name for your source filesystem.
- Data Lake Storage Endpoint - This defaults to
dfs.core.windows.net
. - Authentication Type - The authentication type to use when connecting to your filesystem. Select either Shared Key or Service Principal (OAuth2).
- You'll be asked to enter the security details of your Azure storage account. These will vary depending on which Authentication Type you select. See below.
- Use Secure Protocol - This checkbox determines whether to use TLS encryption in communication with ADLS Gen2. This is enabled by default.
The Azure storage account details necessary will vary depending on whether you selected Shared Key or Service Principal (OAuth2):
Shared key
- Account Name - The Microsoft Azure account name that owns the data lake storage.
- Access Key - The access key associated with the Microsoft Azure account.
- Container Name - The ADLS Gen2 container you want to migrate data from.
Service principal (OAuth2)
- Account Name - The Microsoft Azure account name that owns the data lake storage.
- Container Name - The ADLS Gen2 container you want to migrate data from.
- Client ID - The client ID (also known as application ID) for your Azure service principal.
- Secret - The client secret (also known as application secret) for the Azure service principal.
- Endpoint - The client endpoint for the Azure service principal. This will often take the form of https://login.microsoftonline.com/{tenant}/oauth2/v2.0/token where {tenant} is the directory ID for the Azure service principal. You can enter a custom URL (such as a proxy endpoint that manually interfaces with Azure Active Directory).
Select Save.
Azure identity transformer superuser replacement
By default, when migrating from an ADLS Gen2 source, $superuser
is replaced with the current Data Migrator user when it appears as the owner or owning group of a file or directory.
This behavior is controlled by the fs.azure.identity.transformer.skip.superuser.replacement
property, which defaults to false.
To adjust this behavior and retain the $superuser
ownership, set this property to true
when adding your ADLS Gen2 source filesystem.
This property apples to $superuser
ownership only.
filesystem add adls2 sharedKey --file-system-id adlsSource --storage-account-name StorageAC1 --container-name container1 --shared-key M2oSHAREDKEY2pNL== --source --scan-only --properties fs.azure.identity.transformer.skip.superuser.replacement=true
You can find the fs.azure.identity.transformer.skip.superuser.replacement
property defined here in the official Hadoop documentation.
Configure source filesystems with the CLI
Data Migrator migrates data from a single source filesystem. Data Migrator automatically detects the Hadoop Distributed File System (HDFS) it's installed on and configures it as the source filesystem. If it doesn't detect the HDFS source automatically, you can validate the source. You can override auto-discovery of any HDFS source by manually adding a source filesystem.
At this time, Azure Data Lake Storage (ADLS) Gen2 source filesystems can only be used for one-time migrations.
Use the following CLI commands to add source filesystems:
Command | Action |
---|---|
filesystem add adls2 oauth | Add an ADLS Gen 2 filesystem resource using a service principal and oauth credentials |
filesystem add adls2 sharedKey | Add an ADLS Gen 2 filesystem resource using access key credentials |
filesystem add hdfs | Add a HDFS resource |
filesystem add s3a | Add an S3 filesystem resource. You can choose this when using Amazon S3, Oracle, and IBM Cloud Object Storage. If you want to specify a required filesystem, use --s3type parameter. See s3a optional parameters. |
filesystem add local | Add a local or mounted NAS filesystem resource. |
Validate your source filesystem
Verify that the correct source filesystem is registered or delete the existing one (you define a new source in the step Add a source filesystem.
If Kerberos is enabled or your Hadoop configuration does not contain the information needed to connect to the Hadoop filesystem, use the filesystem auto-discover-source hdfs
command to enter your Kerberos credentials and auto-discover your source HDFS configuration.
If Kerberos is disabled, and Hadoop configuration is on the host, Data Migrator will detect the source filesystem automatically on startup.
Manage your source filesystem
Manage the source filesystem with the following commands:
Command | Action |
---|---|
source clear | Delete all sources |
source delete | Delete one source |
source show | View the source filesystem configuration |
filesystem auto-discover-source hdfs | Enter your Kerberos credentials to access your source HDFS configuration |
To update existing filesystems, first stop all migrations associated with them.
After saving updates to your configuration, you'll need to restart the Data Migrator service for your updates to take effect. In most supported Linux distributions, run the command service livedata-migrator restart
.