Configure your HDFS cluster
Data Migrator reads events from a HDFS cluster's NameNode to track changes to data on the filesystem. The NameNode properties on this page affect how quickly Data Migrator can process changes and recover from network or storage device failures during a migration.
To optimize migration performance, follow these steps:
- Navigate to your cluster manager.
- Add the properties with the appropriate values detailed in the table below to the following locations:
- Cloudera Manager - Go to an Override Snippet. For example, HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.
- Ambari - Go to the Custom hdfs-site.xml section of the HDFS advanced configs.
- Restart the recommended services.
- Restart Data Migrator.
- Navigate to the UI, then remove and add the HDFS source again.
Select a property below for more information.
Property | Description | Default | Recommendation | Impact |
---|---|---|---|---|
dfs.namenode.num.extra. edits.retained | The number of transactions the NameNode retains. Data Migrator reads these transactions to track filesystem activity. | 1000000 | Increase this value to 25000000 to minimize the risk of losing necessary edit logs during a migration. | This setting doesn't impact cluster performance, but you need a few gigabytes of extra storage for the edits. |
dfs.namenode.inotify. max.events.per.rpc | The maximum number of events the NameNode can send to inotify clients (including Data Migrator) in one Remote Procedure Call (RPC) response. | 1000 | Increase this value to 100000 to let migrations process more events with every RPC, increasing your data migrations' maximum data transfer rate. | The increased number of RPC events increases NameNode memory consumption by 1MB. |
dfs.namenode.max.extra. edits.segments.retained | The number of files containing logged edits that the NameNode retains on the filesystem at any given time. | 10000 | Use the default value. | No change. |
dfs.namenode. checkpoint.txns | The number of transactions (events) after which the NameNode creates a checkpoint, splitting the filesystem load by letting it read multiple, smaller checkpoints of events. | 1000000 | Use the default value. | No change. |
The UI will warn when the source file system isn't configured to handle enough events, make changes to the HDFS service configuration to allow more events.
If your source is on Cloudera Manager, the UI warning will remain visible until you make the same changes to the client config in HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.
Make the same change to the service and client config to remove the UI warning. Any services which make use of a HDFS client will need a restart after making this change.
Configuration properties explained
Number of extra edits retained
This is the number of additional edits (also called "events") on the system that the NameNode records in edit log files on the disk.
The NameNode creates a log of every file edit and creates checkpoints periodically to prevent these records stacking up indefinitely. After each checkpoint, the NameNode stashes its edits as a checkpoint file and deletes the original edit logs. However, it stores the most recent edits in a log file, up to the number of edits specified in this property.
Data Migrator reads edits to keep track of filesystem activity for replication on the target filesystem during a migration. If Data Migrator can't access the expected edit logs past its current point in a migration, the migration will fail and will return the exception org.apache.hadoop.hdfs.inotify.MissingEventsException
on each of its org.apache.hadoop.hdfs.DFSInotifyEventInputStream
calls.
If Data Migrator loses access to the HDFS for a long time during a migration, it may try to resume reading deleted edits and fail.
The recommended value is suitable for most large-scale use. If you expect extremely high data edit rates or lengthy outages during migrations, increase this property's value.
Maximum inotify events from each RPC
This is the maximum number of events the NameNode can send to Data Migrator and other inotify clients in a single Remote Procedure Call (RPC) response.
Data Migrator sends RPCs to read events on the filesystem, which it uses to detect data changes that need migrated. The filesystem returns the same number of events as this property's value. On filesystems with lots of activity, the default maximum of 1000 means the NameNode sends events more slowly than they happen to the filesystem, which causes migrations to progress more slowly than filesystem changes.
Maximum extra edits segments retained
This is the number of files containing logged edits that the NameNode retains on the filesystem at any given time.
The edit log is rolled periodically. The interval is defined by dfs.ha.log-roll.period
. The default interval is 120 seconds which means each rolled edit segment log contains as many edits as occurred during a period of 120 seconds.
Increase the number of edits retained instead to preserve the edit logs for Data Migrator, and keep this property at its default value.
NameNode checkpoint transactions
This is the number of transactions (events) after which the NameNode creates a checkpoint, splitting the filesystem load by letting it read multiple, smaller checkpoints of events instead of a single, oversized checkpoint which could negatively affect performance. In most cases, no modification is necessary.
Learn more
See the Hadoop documentation for more information about each of these NameNode configuration properties. See this Knowledge base article for additional information on events, actions and queues in Data Migrator.