Version: 3.3 (latest)

Configure Parallel Scan

note

If you use the Parallel Scan feature and have feedback to share, contact us.

Data Migrator allows you to enable Parallel Scan for Target Match migrations from the UI and CLI.

How it works

Parallel Scan is an improved approach to scanning which traverses the file directory tree structure in a different way. Previously we scanned root nodes of the tree until we reached the lowest-level leaf node - sometimes referred to as depth-first scanning. Parallel Scan takes a breadth-first approach where each level of the tree is scanned completely, before traversing to the next level down.

These two different strategies can be summarised:

Depth-First Scan: when a subdirectory is encountered, descend. Do not continue the scan of the parent directory until you emerge from that descent
Parallel Scan: when a subdirectory is encountered, add a new pending region for it and continue the scan of the parent directory

Parallel Scan delivers significant value for your migrations by enabling the introduction of parallel processing and leveraging a global scan pool to help manage system resources.

What are the benefits

Improvement in scanning performance; faster completion times for your migrations
This method of scanning is normally more efficient in terms of RPC calls on the source and target
Stronger system resource management; helps to manage scanning load on your system by enhancing control of resources on namenode
With this approach the number of threads that will be scanning is limited, so you have control over the load on your filesystem
This scan type enables the introduction of parallel processing so we can scan across many threads, even for a single migration
Parallel processing should increase the rescan speed for migrations that are already mostly consistent, addressing a key challenge if you encounter any error that forces you to perform a reset
There are significant benefits realised by breaking the problem of scanning into more manageable pieces, one of which is avoiding the risk of pending region bloat

Configuration of Parallel Scan

The following properties are available for configuration of the Parallel Scan feature via the API:

Name	Details
`global.scan.limit`	Integer size of the global scan pool from which migrations request resources; default value is 100
`migration.parallel.scan.percentage`	Percentage value that a single migration will request from the scan pool; overrides priority percentages `migration.priority.pool.allocations`
`migration.priority.pool.allocations`	Percentages of the scan pool that different migration priorities will request; default values are 2%, 4%, 8% for low, normal and high priorities respectively

caution

In certain circumstances where there are hundreds of concurrent migrations, with extensive filesystem tree structures, we must ensure that sufficient system resources are allocated for Parallel Scan. Specifically increasing the maximum size of the heap that can be used by the Java Virtual Machine (JVM) is key - see the property JVM_MAX_MEM here.

note

The advantage of setting these properties via the API is that it does not require a restart of any services. Note that when the Data Migrator service is restarted the values of these properties will persist, i.e. their value will be the same as before the restart of the service.

Setting Configuration Properties for Parallel Scan

The properties above can be set as follows and we also provide some guidance on how to configure Parallel Scan.

`global.scan.limit`

When Parallel Scan is enabled, migrations request resources from the global scan pool. This global property sets the overall size of that pool.

Example of how to set global.scan.limit to 50
curl -X 'PUT' \
  'http://127.0.0.1:18080/configuration/property?key=global.scan.limit&value=50' \
  -H 'accept: */*'

`migration.parallel.scan.percentage`

Migrations request a percentage of resources from the global scan pool based on the priority of that migration. This property allows us to override that percentage for an individual migration.

Example of how to set migration.parallel.scan.percentage to 13%
curl -X 'PATCH' \
  'http://127.0.0.1:18080/migrations/my-migration-id?migration.parallel.scan.percentage=13%25' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{
}'

`migration.priority.pool.allocations`

Migrations are allocated resources from the global scan pool based on their priority. Generally the higher the priority, the more resources are requested from the global scan pool. This global property sets the default percentages requested by low, normal and high priority migrations respectively.

Example of how to set migration.priority.pool.allocations to 7%,36%,49%
curl -X 'PUT' \
  'http://127.0.0.1:18080/configuration/property?key=migration.priority.pool.allocations&value=7%25%2C36%25%2C49%25' \
  -H 'accept: */*'

Guidance on the Configuration of Parallel Scan

There are several key points here from our own testing and awareness of these will support the configuration of the Parallel Scan feature for your environment:

The effectiveness of Parallel Scan is subject to the size of the dataset and the amount of latency between source and target. Therefore the performance benefits should be assessed on a per migration basis. There can be no general statements about Parallel Scan across the board as each individual migration has different characteristics which will affect how this approach behaves. Comparative assessment from our own testing produced positive changes for all key perfomance metrics in most cases.
Allocation from the global scan pool is percentage based and weighted by default based on migration priority. Caution should be exercised when changing global default values to avoid potential problems with scalability & concurrency. Any such changes should be controlled and gradual in order to determine their effect.
Default settings have been assessed in terms of stability, scalability, and reliability, and should provide tangible net improvements to base completion times. When tuning default values for the size of the global scan pool, for migration priority or for an individual migration, a maximum of 50 scan threads per migration should be sufficient in most extreme, latency-dependent use cases.
Avoid over-allocation from the global scan pool. While possible, this is unnecessary and inefficient. No hard limits typically apply to the number of threads per process, however threads are indirectly determined by available system resources e.g. memory. For edge cases with near-zero latency, it is possible for the original scan type (two-way scan) to match or even out-perform Parallel Scan. Rather than dialing up the settings as far as possible resulting in each migration requesting a large number from the global scan pool, a lower-risk strategy involves small changes to each setting followed by testing.
Since Parallel Scan does not control the migration lifecycle, consider limiting Running Migrations with considerations for efficiency, quality, and resource optimization. Parallel Scan should now provide mitigation in this area where previously this was constrained.
Increasing the LDM Heap size (JVM_MAX_MEM) for Parallel Scan may be necessary to support environments creating a high volume of pending regions. The number of pending regions should not exceed the total number of directories (per migration). To avoid excessive memory pressures and risk of producing an out of memory error, we recommend limiting the number of Running Migrations.
Parallel Scan with ChecksumActionPolicy enabled is not recommended at this time.
Scanner Statistics have been introduced via the API (/stats/scannerPoolStats), providing a range of scanner related performance metrics. There is currently no runtime logging so usage requires a metric collection script and offline analysis.
The impact on the source filesystem was also evaluated. The conclusion was that the use of parallelism was delivering a more optimal return with consumption levels remaining relatively low and no indications of stress. Note that Parallel Scan is expected to reduce redundant RPC calls.
In summary, Parallel Scan provides the necessary percentage based tunables to manage scanning load on system resources, with efficiencies related to the distributed nature of the data structure. Any tuning for your environment is fundamentally empirical so will require experimentation to achieve your optimal solution driven by the evidence produced from your own testing.

How it works​

What are the benefits​

Configuration of Parallel Scan​

Setting Configuration Properties for Parallel Scan​

global.scan.limit​

migration.parallel.scan.percentage​

migration.priority.pool.allocations​

Guidance on the Configuration of Parallel Scan​