Configure a HDFS source
You can migrate data from a Hadoop Distributed File System (HDFS) by configuring it as your source filesystem for Data Migrator. After you install Data Migrator to a HDFS, Data Migrator automatically configures it as a source for data migrations. However, if you have Kerberos enabled or if your Hadoop configuration isn't in a default location, you'll have to manually configure your HDFS source using the steps below.
You can also use these steps to set up a custom HDFS source instead of the default.
If you deleted the HDFS source that Data Migrator detected automatically, and you want to detect it again, run the command filesystem auto-discover-source hdfs
in the CLI. For more information about this CLI command, see the command reference.
Prerequisites
You need the following:
- An HDFS cluster running Hadoop 2.6 or above.
- If Kerberos is enabled on your filesystem, a valid keytab containing a suitable principal for the HDFS superuser must be available on the Linux host.
- Oracle Big Data Services (BDS) - If running with Oracle's Distribution of Apache Hadoop (ODH), Data Migrator must provide fully qualified hostnames for DNS. Ensure that the following configuration property overrides are added for the source filesystem:Configuration property overrides are an option under Additional Configuration on the Filesystem Configuration screen.
dfs.client.use.datanode.hostname=true
dfs.datanode.use.datanode.hostname=true
- UI
- CLI
Configure a HDFS source filesystem with the UI
From the Dashboard, select a product under Products.
In the Filesystems & Agents menu, select Filesystems.
Select Add source filesystem
Select Hadoop Distributed File System (HDFS) from the Filesystem Type dropdown list.
Enter the following details:
- Display Name - Enter a name for your source filesystem.
- Default Filesystem - Enter the default filesystem property value (fs.defaultFS) from your HDFS configuration. For example, hdfs://examplenameservice1 or hdfs://node1.example.com:8020.
- Kerberos Configuration
- Kerberos Principal - Enter a principal that will map to the HDFS superuser using
auth_to_local
rules, or add the Data Migrator user principal to the superuser group on the Hadoop cluster you're using.- For example: Create the Kerberos principal
ldmuser@realm.com
. Usingauth_to_local
rules, ensure the principal maps to the userhdfs
, or that the userldmuser
is added to the superuser group.
- For example: Create the Kerberos principal
- Kerberos Keytab Location - Enter the path to the Kerberos keytab file containing the Kerberos Principal. The keytab file must be accessible to the local system user running the Data Migrator service (the default is
hdfs
) and must be accessible from the edge node where Data Migrator is installed.- For example: Copy the
ldmuser.keytab
file (whereldmuser
is your intended user) containing the Kerberos principal into the/etc/security/keytabs/
directory on the edge node running Data Migrator, make its permissions accessible to the HDFS user running Data Migrator, and enter the/etc/security/keytabs/ldmuser.keytab
path during Kerberos configuration for the filesystem.
- For example: Copy the
- Kerberos Principal - Enter a principal that will map to the HDFS superuser using
- Additional Configuration - Enter override properties or enter additional properties by adding key/value pairs.
- Configuration Property File Paths - Enter the directory or directories containing your HDFS configuration (such as the
core-site.xml
andhdfs-site.xml
) on your Data Migrator host's local filesystem. This is required if you have Kerberos or a High Availability (HA) HDFS.noteData Migrator reads
core-site.xml
andhdfs-site.xml
once, during filesystem creation, applying any configuration within paths added under Configuration Property File Paths. After creation, these paths are no longer visible in the UI. You can see all filesystem properties using the API. - Configuration Property Overrides (Optional) - Enter override properties or additional properties for your HDFS filesystem by adding key/value pairs.
- Configuration Property File Paths - Enter the directory or directories containing your HDFS configuration (such as the
- Success File - Enter the file name or glob pattern that Data Migrator will use to recognize client application success files when they are created in migration directories. These files will be migrated last after all other data in the directory has been successfully migrated. You can use these files to confirm the directory they're in has finished migrating.
- Filesystem Options
- Select Live Migration to include Live as a migration type when creating migrations. Or select One-time Migration to limit migration types available to one-time and recurring. See migration types to learn more about each type.
Select Save.
You can now migrate data from the HDFS source.
For more information about configuring Kerberos, see the section below. If you have problems configuring Kerberos, see the troubleshooting section.
Configure an HDFS source filesystem with the CLI
To create an HDFS source, run the filesystem add hdfs
command in the WANdisco CLI with the --source
parameter:
filesystem add hdfs [--file-system-id] string
[--default-fs] string
[--user] string
[--kerberos-principal] string
[--kerberos-keytab] string
[--source]
[--scan-only]
[--success-file] string
[--properties-files] list
[--properties] string
Mandatory parameters
--file-system-id
The ID to give the new filesystem resource.--default-fs
A string that defines how Data Migrator accesses HDFS. You can enter it in the following ways:- As a single HDFS URI, such as
hdfs://192.168.1.10:8020
(using an IP address) orhdfs://myhost.localdomain:8020
(using a hostname). - As an HDFS URI that references a nameservice if the NameNodes have high availability. For example,
hdfs://mynameservice
. For more information, see HDFS High Availability.
- As a single HDFS URI, such as
--properties-files
Reference a list of existing properties files that contain Hadoop configuration properties in the format used bycore-site.xml
orhdfs-site.xml
.noteIf you're using a HA HDFS filesystem, you must include this parameter. Define the absolute paths to the
core-site.xml
andhdfs-site.xml
files. For example,--properties-files /etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
.
Optional parameters
--user
The name of the HDFS user to be used when performing operations against the filesystem. In environments where Kerberos is disabled, this user must be the HDFS superuser, such ashdfs
.--kerberos-principal
The Kerberos principal to authenticate with and perform migrations as. This principal should map to the HDFS superuser using auth_to_local rules.--kerberos-keytab
The Kerberos keytab that contains the principal defined for the--kerberos-principal
parameter. This must be accessible to the local system user running the Data Migrator service (default ishdfs
).--source
Enter this parameter to use the filesystem resource created as a source. This is referenced in the UI when configuring the Unknown source.--scan-only
Enter this parameter to create a static source filesystem for use in one-time migrations. Requires--source
.--properties
Enter properties to use in a comma-separated key/value list.--success-file
Enter a file name or glob pattern for files that Data Migrator will migrate last from the directory they're contained in. For example,--success-file /mypath/myfile.txt
or--success-file /**_SUCCESS
. You can use these files to confirm the directory they're in has finished migrating.
Examples
filesystem add hdfs --file-system-id mysource
--default-fs hdfs://sourcenameservice
--properties-files /etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
--source
filesystem add hdfs --file-system-id mysource
--default-fs hdfs://sourcenameservice
--properties-files /etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
--kerberos-keytab /etc/security/keytabs/hdfs.headless.keytab
--kerberos-principal hdfs@SOURCEREALM.COM
--source
For more CLI commands, see the command reference.
Configure Kerberos
From the tabs below, select which filesystem(s) you want to set up Kerberos on to view the relevant instructions.
- Source only
- Target only
- Source and target with cross-realm trust
- Source and target without cross-realm trust
To set up Kerberos on a source filesystem, enter the Kerberos details for your source filesystem in the Kerberos parameters above.
To migrate data from a source filesystem without Kerberos to a target with Kerberos:
- Copy the
krb5.conf
file with the configuration and keytabs for your target HDFS to your source HDFS. - Open
/etc/wandisco/livedata-migrator/vars.env
and add the file path for yourkrb5.conf
file toLDM_EXTRA_JVM_ARGS
. For example:LDM_EXTRA_JVM_ARGS="-Djava.security.krb5.conf=/etc/remote/krb5.conf"
- Restart Data Migrator.
- Fill in the Kerberos parameters during target HDFS creation with the details of the Kerberos configuration you moved to your source HDFS.
Use Kerberos on both filesystems with cross-realm trust
To use cross-realm trust to migrate data from a Kerberos-enabled source filesystem to a Kerberos-enabled target filesystem, fill in the parameters above with the details of a Kerberos configuration that has the correct cross-realm trust settings.
See the links below for Kerberos configuration guidance for common Hadoop distributions:
Use Kerberos on both filesystems without cross-realm trust
See Configure Kerberos.
Next steps
Configure a target filesystem to migrate data to. Then create a migration.