Version: 2.4.3

View diagnostics

To get insights into a Data Migrator instance, you can use the UI or the CLI to view diagnostics.

note

We recommend superusers use the Diagnostics page to correlate metric values across the available histograms, graphs, and information panels together with Customer Support.

View diagnostics with the UI

To view diagnostics with the UI:

Go to your dashboard and select the product for which you want to view diagnostics.
Select Diagnostics in the left navigation bar.
Under Diagnostics Metrics, select the data duration from the dropdown list:
- 1 hour
- 1 day
- 1 week
- 30 days

Graphs in the UI

Graph	Description
Active file transfers/pull threads	Hover over this histogram to view the number of active file transfers at a specific point in time. The queue of migration actions is serviced by a set of threads called pull threads. By default there are 150 threads. This means Data Migrator can execute up to 150 migration actions at a time. Migration actions can include transferring or deleting files, for example. For more information on how how files are transferred, see the section Data transfers and migration actions in the Knowledge base article How does Data Migrator work?.

Graph

Description

Active file transfers/pull threads

Hover over this histogram to view the number of active file transfers at a specific point in time.

The queue of migration actions is serviced by a set of threads called pull threads. By default there are 150 threads. This means Data Migrator can execute up to 150 migration actions at a time. Migration actions can include transferring or deleting files, for example.

For more information on how how files are transferred, see the section Data transfers and migration actions in the Knowledge base article How does Data Migrator work?.

Remote Procedure Call diagnostics

Monitor historic values for Remote Procedure Call (RPC) metrics. See RPC diagnostics.

View diagnostics with the CLI

You can view diagnostics from your terminal either as a snapshot or continually updated.

CLI

To get a diagnostics summary in text format, go to the CLI and run the command:

curl http://127.0.0.1:18080/diagnostics/summary.txt

To view diagnostics continually updated, go to the CLI and run the command:

watch curl -s http://127.0.0.1:18080/diagnostics/summary.txt

Data Migrator CLI

To get a diagnostics summary with the Data Migrator CLI, run the command:

status --diagnostics

To view diagnostics continually updated with the Data Migrator CLI, run the command:

status --diagnostics --watch

Example diagnostics summary

The following is an example of a diagnostics summary output from the CLI.

Refer to the table below to understand the metrics.

Time: 2022/02/03 13:51:48
Uptime: 0 Days 0 Hours 1 Minutes 44 Seconds
SystemCpuLoad: 0.3673 ProcessCpuLoad: 0.0691
Linux CPU IO wait time percentage since last collection: 0.2%
JVM GcCount: 11 GcPauseTime: 1 s (1282 ms)
OS Connections: 4, Tx: 946.7 KiB, Rx: 5.6 MiB, Retransmit: 0
Transfer Bytes (10/30/300s): 0.08 Gib/s, 0.06 Gib/s, 0.01 Gib/s
Transfer Files (10/30/300s): 0.00/s 0.37/s 0.07/s
Active Transfers/pull.threads: 1/150
Migrations: 0 RUNNING, 1 LIVE, 0 STOPPED
Actions Total: 1, Largest: "string" 1, Peak: "string" 25
PendingRegions Total: 0 Avg: 0, Largest: "string" Max: 0
FailedPaths Total: 0 Avg: 0, Largest: "string" 0
File Transfer Retries Total: 0 Avg: 0, Largest: "string" 0
Total Excluded Scan files/dirs/bytes: 0, 0, 0 B
Total Iterated Scan files/dirs/bytes: 23, 2, 550.1 MB
EventsBehind Current/Avg/Max: 0/0/0, RPC Time Avg/Max: 4/54
EventsQueued: 0, Total Events Added: 5
No. Recently Active Migrations: 1, Busiest Migration: "string" 23/23
Transferred File Size Percentiles:
10.5 MB, 10.5 MB, 10.5 MB, 10.5 MB, 10.5 MB, 10.5 MB, 10.5 MB, 10.5 MB, 10.5 MB, 13.6 MB
Transferred File Transfer Rates Percentiles per Second:
2.3 MB, 2.4 MB, 2.7 MB, 3.0 MB, 3.0 MB, 3.2 MB, 3.3 MB, 3.6 MB, 4.4 MB, 5.4 MB
Active Total Transferred Bytes/Total File Size: 260.0 MB/323.4 MB
Active File Size Percentiles:
323.4 MB, 323.4 MB, 323.4 MB, 323.4 MB, 323.4 MB, 323.4 MB, 323.4 MB, 323.4 MB, 323.4 MB, 323.4 MB
Active File Transfer Rates Percentiles per Second:
10.8 MB, 10.8 MB, 10.8 MB, 10.8 MB, 10.8 MB, 10.8 MB, 10.8 MB, 10.8 MB, 10.8 MB, 10.8 MB
Source Event Latency Percentiles (hours:minutes:seconds):
000:00:20, 000:00:21, 000:00:21, 000:00:21, 000:00:22, 000:00:22, 000:00:22, 000:00:22, 000:00:23, 000:00:23

Understand diagnostic metrics

The following table lists the metrics in the diagnostics summary and explains what they are for:

Metric	Description
Time	The current time.
Uptime	The length of time Data Migrator has been running. If Data Migrator panics, it restarts up to five times automatically. The uptime metric can be an indication of panic.
System CPU Load	If the value of this metric approaches one (1), the system is CPU-bound. This is often due to Transport Layer Security (TLS) load. If you have configured S3 as a target, you can switch to the WildFly library.
IO Wait	The percentage of time the system is stalled writing to disk. When the system is stalled, you see inconsistent transfer graphs. If you have configured S3 as a target, the SDK caches data to disk resulting in heavy disk usage. To improve performance, you can configure the system to use Virtual Memory. If you use a spinning disk and transfer small files, the IO wait is high as large numbers of small files require a lot of Data Migrator database updates.
Linux Pressure	This metric includes the IO wait and CPU load, and appears only on systems running later kernels.
JVM GcCount	The number of times Java has garbage collected and the total time spent collecting garbage. If the system has insufficient Java heap, it takes a lot of time collecting Java garbage and stalls. Garbage collection should take milliseconds. Check the garbage collection logs for time spent collecting garbage.
OS Connections	The number of connections depends on the target and SDK configuration. The operating system (OS) receives and transmits buffers (RX/TX). If the OS has large transmit buffers (tens to hundreds of MBs), this means Data Migrator is supplying the OS with sufficient data to transfer so the issue is not with Data Migrator.
Retransmit	Retransmit is the number of retransmits for the current connections which should remain close to zero (0). If there are retransmits, there are stalls on the connections and poor network performance.
Transfer Bytes	The number of bytes transferred within a timeframe. The smallest timeframe is 10 seconds. The transfer bytes metric is measured at the application layer and there is buffering in the SDK and in the OS.
Transfer Files Rate	The rate at which files are transferred. If you transfer large files, the rate is low. We transfer multiple files in parallel. If the number of transfer bytes is low, check the transfer files rate metric value. If the transfer files rate is high, Data Migrator is probably migrating small files and can't fill the network pipe.
Active Transfers/Pull Threads	The number of files Data Migrator is migrating actively and the number of files it can migrate in parallel. The count of active transfers comes from `statsManager.getFileTrackerManager().getActiveFileTrackers().count()`. The value of pull threads is the upper bound of that number and is set by configuration (defaults to 150). These metrics should be the same in a busy system. Note that pull threads also perform operations such as creating and deleting directories, so these metrics may not always be the same. There is a correlation between active transfers and the number of connections.
Migrations	A summary of migrations and their states.
Actions	The total number of actions outstanding such as transferring or deleting files. The summary includes the migration with the largest number of actions and the historic peak for actions for a migration. If the largest and the peak values are the same, the system may be accumulating a backlog of actions. If the largest value is less than the peak value, the system may be less busy than it was previously. More than a couple of hundred thousand actions is a busy system. If this queue continues to grow, the system is overloaded.
Pending Regions	The total number of pending regions. This is a key indicator of load on the system. A pending region is part of the source filesystem that Data Migrator needs to rescan. Up to a couple of hundred thousand pending regions is acceptable. The average value is the average number of pending regions for each migration. The largest value is the migration with most pending regions. The maximum value is the number of pending regions that the migration with the largest value has. Once a maximum value reaches one million, the migration is stopped automatically. You are warned through notifications to reset the migration before it stops automatically when the number of pending regions grows too large.
Failed Paths	The total number of items Data Migrator hasn't been able to migrate.
File Transfer Retries	The number of times Data Migrator has attempted to migrate a file. By default, if Data Migrator fails to transfer a file, it tries 180 times to transfer it before it marks the file as failed. This metric should be stable or increasing slowly. If it increases rapidly, it may be an indication of files being too large for the target filesystem.
Total Excluded Scan	The total number of files and bytes excluded from the initial scan across migrations.
Total Iterated Scan	Number of files, directories and bytes we transferred as part of the initial scan across migrations.
Events Behind	The number of events on the NameNode not yet retrieved. If this number exceeds the number the NameNode retains, then all migrations will stop automatically. Ensure the NameNode is tuned according to the Data Migrator recommendations.
Events Queued	The queue of events needed to process into pending regions and actions.
Number of Recently Active Migrations	How many migrations transferred files, which migration transferred the most files, and how many (in the last 10,000 file transfers). This can be used to understand if some migrations are not being supplied with data.
Transferred File Size Percentiles	File size for the last 10,000 files broken down into percentiles. In the example above, all the files are the same size. Note that small files use low network bandwidth.
Transferred File Transfer Rates Percentiles per Second	The transfer rates (for the last 10,000 transferred files) broken down into percentiles. If there is a large distribution of file sizes with many small sizes, the distribution of transfer rates is similar showing the correlation between file sizes and bandwidth usage.
Active Total Transferred Bytes/Total File Size	The total file size Data Migrator is currently trying to transfer and how much of that it has completed. This provides an indication of whether files are small or large.
Active File Size Percentiles	File size percentiles of files Data Migrator is currently trying to transfer. In the example above, there is only one file being transferred so the percentiles are all the same. Small files show a low bandwidth usage.
Active File Transfer Rates Percentiles per Second	Transfer rate percentiles for files that are actively being transferred. In the example above, there is only one file being transferred so they are all the same. Small files have low transfer rates. This metric can indicate if there are connection problems. If all files are the same size and, therefore, the spread in file size percentiles is small, then the transfer rate percentiles should be the same.
Source Event Latency Percentiles (hours:minutes:seconds)	How long the event was in the system before Data Migrator transferred the file. This is broken down into percentiles for the last 10,000 files transferred. This indicates the latency for files that were recently transferred. This provides an indication of how long it takes for a file to be created on the source to be transferred to the target.
Total number of currently requeuing actions	This is a count of all the paths requeuing across all migrations. In the UI, this is called Total Active Retries. To view the number of requeues for a single migration, see the API endpoint `http://127.0.0.1:18080/diagnostics/`.

For more information, see the Knowledge base article Monitoring and troubleshooting.

View diagnostics with the UI​

Graphs in the UI​

Remote Procedure Call diagnostics​

View diagnostics with the CLI​

CLI​

Data Migrator CLI​

Example diagnostics summary​

Understand diagnostic metrics​