View diagnostics
To get insights into a Data Migrator instance, you can use the UI or the CLI to view diagnostics.
We recommend superusers use the Diagnostics page to correlate metric values across the available histograms, graphs, and information panels together with Customer Support.
- UI
- CLI
View diagnostics with the UI
To view diagnostics with the UI:
- Go to your dashboard and select the product for which you want to view diagnostics.
- Select Diagnostics in the left navigation bar.
- Under Diagnostics Metrics, select the data duration from the dropdown list:
- 1 hour
- 1 day
- 1 week
- 30 days
Graphs in the UI
Graph | Description |
---|---|
Active file transfers/pull threads | Hover over this histogram to view the number of active file transfers at a specific point in time. The queue of migration actions is serviced by a set of threads called pull threads. By default there are 150 threads. This means Data Migrator can execute up to 150 migration actions at a time. Migration actions can include transferring or deleting files, for example. For more information on how how files are transferred, see the section Data transfers and migration actions in the Knowledge base article How does Data Migrator work?. |
Remote Procedure Call diagnostics
Monitor historic values for Remote Procedure Call (RPC) metrics. See RPC diagnostics.
View diagnostics with the CLI
You can view diagnostics from your terminal either as a snapshot or continually updated.
CLI
To get a diagnostics summary in text format, go to the CLI and run the command:
curl http://127.0.0.1:18080/diagnostics/summary.txt
To view diagnostics continually updated, go to the CLI and run the command:
watch curl -s http://127.0.0.1:18080/diagnostics/summary.txt
Data Migrator CLI
To get a diagnostics summary with the Data Migrator CLI, run the command:
status --diagnostics
To view diagnostics continually updated with the Data Migrator CLI, run the command:
status --diagnostics --watch
Example diagnostics summary
The following is an example of a diagnostics summary output from the CLI.
Refer to the table below to understand the metrics.
Understand diagnostic metrics
The following table lists the metrics in the diagnostics summary and explains what they are for:
Metric | Description |
---|---|
Time | The current time. |
Uptime | The length of time Data Migrator has been running. If Data Migrator panics, it restarts up to five times automatically. The uptime metric can be an indication of panic. |
System CPU Load | If the value of this metric approaches one (1), the system is CPU-bound. This is often due to Transport Layer Security (TLS) load. If you have configured S3 as a target, you can switch to the WildFly library. |
IO Wait | The percentage of time the system is stalled writing to disk. When the system is stalled, you see inconsistent transfer graphs. If you have configured S3 as a target, the SDK caches data to disk resulting in heavy disk usage. To improve performance, you can configure the system to use Virtual Memory. If you use a spinning disk and transfer small files, the IO wait is high as large numbers of small files require a lot of Data Migrator database updates. |
Linux Pressure | This metric includes the IO wait and CPU load, and appears only on systems running later kernels. |
JVM GcCount | The number of times Java has garbage collected and the total time spent collecting garbage. If the system has insufficient Java heap, it takes a lot of time collecting Java garbage and stalls. Garbage collection should take milliseconds. Check the garbage collection logs for time spent collecting garbage. |
OS Connections | The number of connections depends on the target and SDK configuration. The operating system (OS) receives and transmits buffers (RX/TX). If the OS has large transmit buffers (tens to hundreds of MBs), this means Data Migrator is supplying the OS with sufficient data to transfer so the issue is not with Data Migrator. |
Retransmit | Retransmit is the number of retransmits for the current connections which should remain close to zero (0). If there are retransmits, there are stalls on the connections and poor network performance. |
Transfer Bytes | The number of bytes transferred within a timeframe. The smallest timeframe is 10 seconds. The transfer bytes metric is measured at the application layer and there is buffering in the SDK and in the OS. |
Transfer Files Rate | The rate at which files are transferred. If you transfer large files, the rate is low. We transfer multiple files in parallel. If the number of transfer bytes is low, check the transfer files rate metric value. If the transfer files rate is high, Data Migrator is probably migrating small files and can't fill the network pipe. |
Active Transfers/Pull Threads | The number of files Data Migrator is migrating actively and the number of files it can migrate in parallel. The count of active transfers comes from statsManager.getFileTrackerManager().getActiveFileTrackers().count() . The value of pull threads is the upper bound of that number and is set by configuration (defaults to 150). These metrics should be the same in a busy system. Note that pull threads also perform operations such as creating and deleting directories, so these metrics may not always be the same. There is a correlation between active transfers and the number of connections. |
Migrations | A summary of migrations and their states. |
Actions | The total number of actions outstanding such as transferring or deleting files. The summary includes the migration with the largest number of actions and the historic peak for actions for a migration. If the largest and the peak values are the same, the system may be accumulating a backlog of actions. If the largest value is less than the peak value, the system may be less busy than it was previously. More than a couple of hundred thousand actions is a busy system. If this queue continues to grow, the system is overloaded. |
Pending Regions | The total number of pending regions. This is a key indicator of load on the system. A pending region is part of the source filesystem that Data Migrator needs to rescan. Up to a couple of hundred thousand pending regions is acceptable. The average value is the average number of pending regions for each migration. The largest value is the migration with most pending regions. The maximum value is the number of pending regions that the migration with the largest value has. Once a maximum value reaches one million, the migration is stopped automatically. You are warned through notifications to reset the migration before it stops automatically when the number of pending regions grows too large. |
Failed Paths | The total number of items Data Migrator hasn't been able to migrate. |
File Transfer Retries | The number of times Data Migrator has attempted to migrate a file. By default, if Data Migrator fails to transfer a file, it tries 180 times to transfer it before it marks the file as failed. This metric should be stable or increasing slowly. If it increases rapidly, it may be an indication of files being too large for the target filesystem. |
Total Excluded Scan | The total number of files and bytes excluded from the initial scan across migrations. |
Total Iterated Scan | Number of files, directories and bytes we transferred as part of the initial scan across migrations. |
Events Behind | The number of events on the NameNode not yet retrieved. If this number exceeds the number the NameNode retains, then all migrations will stop automatically. Ensure the NameNode is tuned according to the Data Migrator recommendations. |
Events Queued | The queue of events needed to process into pending regions and actions. |
Number of Recently Active Migrations | How many migrations transferred files, which migration transferred the most files, and how many (in the last 10,000 file transfers). This can be used to understand if some migrations are not being supplied with data. |
Transferred File Size Percentiles | File size for the last 10,000 files broken down into percentiles. In the example above, all the files are the same size. Note that small files use low network bandwidth. |
Transferred File Transfer Rates Percentiles per Second | The transfer rates (for the last 10,000 transferred files) broken down into percentiles. If there is a large distribution of file sizes with many small sizes, the distribution of transfer rates is similar showing the correlation between file sizes and bandwidth usage. |
Active Total Transferred Bytes/Total File Size | The total file size Data Migrator is currently trying to transfer and how much of that it has completed. This provides an indication of whether files are small or large. |
Active File Size Percentiles | File size percentiles of files Data Migrator is currently trying to transfer. In the example above, there is only one file being transferred so the percentiles are all the same. Small files show a low bandwidth usage. |
Active File Transfer Rates Percentiles per Second | Transfer rate percentiles for files that are actively being transferred. In the example above, there is only one file being transferred so they are all the same. Small files have low transfer rates. This metric can indicate if there are connection problems. If all files are the same size and, therefore, the spread in file size percentiles is small, then the transfer rate percentiles should be the same. |
Source Event Latency Percentiles (hours:minutes:seconds) | How long the event was in the system before Data Migrator transferred the file. This is broken down into percentiles for the last 10,000 files transferred. This indicates the latency for files that were recently transferred. This provides an indication of how long it takes for a file to be created on the source to be transferred to the target. |
Total number of currently requeuing actions | This is a count of all the paths requeuing across all migrations. In the UI, this is called Total Active Retries. To view the number of requeues for a single migration, see the API endpoint http://127.0.0.1:18080/diagnostics/ . |
For more information, see the Knowledge base article Monitoring and troubleshooting.