Version: 2.2

On-premises Hadoop to Amazon S3 and AWS Glue

These are an outline of the steps needed to ready your environment for migration of data and metadata.

Time to complete: 1 hour (assuming all prerequisites are met).

Recommended technical knowledge

Linux operating system
Apache Hadoop administration
- Hadoop Distributed Filesystem (HDFS)
- Apache Hive
Amazon Web Services (AWS) Service configuration and management
- Amazon Simple Storage Service (Amazon S3)
- AWS Glue

Prerequisites

On-premises Hadoop cluster

All prerequisites are met for the source environment.
Network connectivity between your edge node and Amazon S3 & AWS Glue. These are some of the options available depending on your use-case:
- AWS Site-to-Site VPN - suitable for small/medium/test migrations.
- AWS Direct Connect - suitable for larger migrations (up to 100Gbps).
Ensure that all security best practices are taken into consideration when setting up either Site-to-Site or Direct Connect.

Amazon S3 and AWS Glue

For your target environment, make sure you have the following:

An AWS account.
An Amazon S3 bucket.
An AWS Glue Data Catalog instance.
- Internal network configuration between AWS Glue and Amazon S3 including DNS configuration for your VPC.
- An AWS Glue connection.
- If applicable, an AWS Glue crawler configured to crawl your Amazon S3 bucket.

AWS security

All AWS services should be secured using best practices. This is a summary of those practices and which services they apply to.

Amazon S3

All Amazon S3 buckets should adhere to AWS best practices for Amazon S3. These include the following:

Use IAM to grant access to Amazon S3 buckets.
Follow IAM security best practices when creating policies.
- Create an individual IAM user/role for access to the bucket (don't use the AWS root account).
- Follow the policy of least privilege to grant read and write access to the bucket for Data Migrator. This includes limiting access through bucket policies and access control lists.
- Limit the IAM policy to the minimal rules required for Data Migrator operations on the bucket:
  - List available buckets
  - Obtain bucket location
  - List bucket objects
  - Put, delete or retrieve objects from the bucket.
IAM Access and Secret Keys are supported if your are unable to use IAM Roles.
- Use filesystem update s3a when rotating access keys to update Data Migrator with the new key IDs.
- The access and secret keys can be stored or referenced in a location used by the Default AWS Credentials Provider Chain. This class can be defined when using the --credentials-provider option.
  If using the Simple AWS Credentials Provider class, the access and secret keys will be stored in the Data Migrator database.
Enable server-side encryption for your Amazon S3 bucket.
Block public access to the Amazon S3 bucket unless you explicitly require it.

AWS Glue

All AWS Glue instances should be configured using AWS security practices for Glue. These include the following:

Set up IAM permissions for Data Migrator to access AWS Glue, including:
IAM Access and Secret Keys are supported if your are unable to use IAM Roles.
- Use hive agent configure glue when rotating access keys to update the Data Migrator metadata service with the new key IDs.
- The access and secret keys can be stored or referenced in a location used by the Default AWS Credentials Provider Chain. This class is the default when not using the --credentials-provider option.
  If you enter the Static Credentials Provider Factory class, the access and secret keys will be stored in the Data Migrator metadata service database.
Enable encryption in AWS Glue.

AWS deployment

These are some of the options to consider before creating your Amazon S3 bucket and AWS Glue instance:

AWS costs and quotas

The following table lists the required and optional AWS services that are applicable to this use-case:

Service	Required?	Pricing	Quotas
Amazon S3	Yes	Amazon S3 pricing	Amazon S3 quotas
AWS Glue	Yes	AWS Glue pricing	AWS Glue quotas
Site-to-Site VPN	Optional	Site-to-Site VPN pricing	Site-to-Site VPN quotas
Direct Connect	Optional	Direct Connect pricing	Direct Connect quotas
Key Management Service (KMS)	Optional	KMS pricing	KMS quotas

See AWS pricing for more general guidance.

Install Data Migrator on your Hadoop edge node

Install Data Migrator on your Hadoop edge node.

Configuration for data migrations

Add Hadoop Distributed File System (HDFS) as source filesystem

Configure your on-premises HDFS as the source filesystem:
- UI
- CLI
(CLI only) Validate that your on-premises HDFS is now set as your source filesystem:
source show
If the filesystem shown is incorrect, delete it using source delete and configure the source manually:
filesystem add hdfs
Ensure to include the --source parameter when using the command above.

Add Amazon S3 bucket as target filesystem

Configure your Amazon S3 bucket as your target filesystem:

Test the S3 bucket target

Data Migrator automatically tests the connection to any target filesystem added to ensure the details are valid and a migration can be created and run.

To check that the configuration for the filesystem is correct:

UI - the target will show a healthy connection.
CLI - the filesystem show command will show only a target that was successfully added:
Example
```
filesystem show --file-system-id myAWSBucket
```

To test a migration to the S3 bucket, create a migration and run it to transfer data, then check that the data has arrived in its intended destination.

Create path mappings (optional)

Create path mappings to ensure that data for managed Hive databases and tables are migrated to an appropriate folder location on your Amazon S3 bucket.

This lets you start using your source data and metadata immediately after migration, as it will be referenced correctly by your AWS Glue crawler and/or AWS Glue Studio.

Configuration for metadata migrations

Add Apache Hive as source hive agent

Configure the source hive agent to connect to the Hive metastore on the on-premises Hadoop cluster:
- UI
- CLI

Test the Apache Hive source hive agent

Data Migrator automatically tests the connection to any hive agent added to ensure the details are valid and a metadata migration can be created and run.

To check that the configuration for the hive agent is correct:

UI - the agent will show a healthy connection.
CLI
Example
```
hive agent check --name hiveAgent
```

To test a metadata migration from the Apache Hive agent, create a metadata migration and run it to transfer data, then check that the data has arrived in its intended destination.

Add AWS Glue as target hive agent

Configure a hive agent to connect to AWS Glue:

Test the AWS Glue target hive agent

Data Migrator automatically tests the connection to any hive agent added to ensure the details are valid and a metadata migration can be created and run.

To check that the configuration for the hive agent is correct:

UI - the agent will show a healthy connection.
CLI
Example
```
hive agent check --name hiveAgent
```

To test a metadata migration to the AWS Glue target agent, create a metadata migration and run it to transfer data, then check that the data has arrived in its intended destination.

Troubleshooting

In the event a filesystem or hive agent could not be added, Data Migrator will give you error messages in most cases to help you discern the issue.

If no data appears to have been transferred in either a migration or a metadata migration, check Data Migrator's notifications for errors. In most cases, these will give you the information you need to diagnose any problems.

In the event of a problem you cannot diagnose, contact WANdisco support.

Network architecture

Data Migrator Network Architecture

The diagram is an example of Data Migrator architecture between two environments - On-premises and AWS Cloud.

On-premises

All migration activity, both reads and writes, goes through the Data Migrator service. Data transfer to AWS is via Port 443 (HTTPS). Metadata transfer through the Hive Migrator functionality is over port 6780/6781 (HTTP/HTTPS).
Interaction with Data Migrator is handled either through the WANdisco UI component (port 8081) or CLI (using the Data Migrator API port 18080). The CLI does not open any ports itself and acts as a client.

AWS Cloud

The WAN connection to AWS from the source environment (see AWS Site-to-Site VPN and AWS Direct Connect).
A VPC and Subnet (see Working with VPCs and subnets) that are configured with access to the underlying storage and metastore and necessary external connectivity to the source environment.
The IAM role and associated permissions for access to resources.
The underlying storage (Amazon S3 bucket) and metastore (AWS Glue Data Catalog).
info
By default, S3 buckets are set as private to prevent unauthorized access. We strongly recommend that you read the following blog on the AWS support site for a good overview of this subject:
Best practices for securing sensitive data in AWS data stores
The AWS Key Management Service configured to encrypt both the Amazon S3 bucket and AWS Glue instance.

Next steps

Start defining exclusions and migrating data. You can also create metadata rules and start migrating metadata.

Recommended technical knowledge​

Prerequisites​

On-premises Hadoop cluster​

Amazon S3 and AWS Glue​

AWS security​

Amazon S3​

AWS Glue​

AWS deployment​

AWS costs and quotas​

Install Data Migrator on your Hadoop edge node​

Configuration for data migrations​

Add Hadoop Distributed File System (HDFS) as source filesystem​

Add Amazon S3 bucket as target filesystem​

Test the S3 bucket target​

Create path mappings (optional)​

Configuration for metadata migrations​

Add Apache Hive as source hive agent​

Test the Apache Hive source hive agent​

Add AWS Glue as target hive agent​

Test the AWS Glue target hive agent​

Troubleshooting​

Network architecture​

On-premises​

AWS Cloud​

Next steps​