Configure an Amazon S3 source
You can migrate data to from Amazon Simple Storage Service (Amazon S3) by configuring one as a source filesystem.
Follow these steps to create an Amazon S3 bucket as a source using either the UI or CLI.
Prerequisites
You need the following:
An Amazon S3 bucket. See the Amazon S3 bucket documentation.
Authentication details for your bucket. See below for more information.
If you're configuring your own SQS queue in AWS for live replication with Data Migrator, the queue must be attached to the S3 bucket:
- Ensure the queue has an Access Policy applied allowing your bucket to send messages to the queue. See grant destinations permissions to s3
- Ensure the bucket has an Event notification to the destination queue with all event types required for live replication. See enable event notifications.
Events required for live replication, enable the following event types:
Object creation: Select
All object create events
or select individuallyPut
,Post
,Copy
,Multipart upload completed
.Object removal: Select
All object removal events
or select individuallyPermanently delete
,Delete marker created
.
When migrating data with Amazon S3 as a source, data contained in paths with two or more consecutive forward slashes can't be replicated.
When using Amazon S3 as a source, do not include the SQS initialization path (sqs-init-path/
) in any migration, this will cause an issue where Data Migrator will prevent subsequent migrations from progressing to a Live status.
Configure Amazon S3 as a source with the UI
From the Dashboard, select an instance under Instances.
In the Filesystems & Agents menu, select Filesystems.
Select Add source filesystem
Select Amazon S3 from the Filesystem Type dropdown list.
Enter the following details:
Display Name - The name you want to give your source filesystem.
Bucket Name - The reference name of the Amazon S3 bucket you are using.
Authentication Method - The Java class name of a credentials provider for authenticating with the S3 endpoint.
The Authentication Method options available include:
Access Key and Secret
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
Use this provider to enter credentials as an access key and secret access key with the following entries:
Access Key - Enter the AWS access key. For example,
RANDOMSTRINGACCESSKEY
. If you have configured a Vault for secrets storage, use a reference to the value stored in your secrets store.Secret Key - Enter the secret key that corresponds with your access key. For example,
RANDOMSTRINGPASSWORD
. If you have configured a Vault for secrets storage, use a reference to the value stored in your secrets store.
AWS Identity and Access Management
com.amazonaws.auth.InstanceProfileCredentialsProvider
Use this provider if you're running Data Migrator on an EC2 instance that has been assigned an IAM role with policies that allow it to access the S3 bucket.
AWS Hierarchical Credential Chain
com.amazonaws.auth.DefaultAWSCredentialsProviderChain
A commonly used credentials provider chain that looks for credentials in this order:
- Environment Variables -
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
, orAWS_ACCESS_KEY
andAWS_SECRET_KEY
. - Java System Properties -
aws.accessKeyId
andaws.secretKey
. - Web Identity Token credentials from the environment or container.
- Credential profiles file at the default location (
~/.aws/credentials
) shared by all AWS SDKs and the AWS CLI. - Credentials delivered through the Amazon EC2 container service if the
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
environment variable is set and security manager has permission to access the variable. - Instance profile credentials delivered through the Amazon EC2 metadata service.
- Environment Variables -
Environment Variables
com.amazonaws.auth.EnvironmentVariableCredentialsProvider
Use this provider to enter an access key and a secret access key as either
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
, orAWS_ACCESS_KEY
andAWS_SECRET_KEY
.EC2 Instance Metadata Credentials
com.amazonaws.auth.InstanceProfileCredentialsProvider
Use this provider if you need instance profile credentials delivered through the Amazon EC2 metadata service.
Profile Credentials Provider
com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider
Use this provider to enter a custom profile configured to access Amazon S3 storage. You can find AWS credential information in a local file named
credentials
in a folder named.aws
in your home directory.Enter an AWS Named Profile and a Credentials File Path. For example,
~/.aws/credentials
.For more information, see Using the AWS Credentials File and Credential Profiles.
Custom Provider Class
Use this if you want to enter your own class for the credentials provider.
JCEKS Keystore
This authentication method uses an access key and a secret key for Amazon S3 contained in a Java Cryptography Extension KeyStore (JCEKS). The keystore must contain key/value pairs for the access key
fs.s3a.access.key
and the secret keyfs.s3a.secret.key
.infoYou must configure HDFS as a target to be able to select JCEKS Keystore. The HDFS resource must exist on the same Data Migrator instance as the Amazon S3 filesystem you're adding. Due to this dependency, be aware of Backup and Restore limitations before performing a backup with this configuration.
JCEKS HDFS - Select the HDFS filesystem where your JCEKS file is located.
JCEKS Keystore Path - Enter the path containing the JCEKS keystore. For example,
jceks://hdfs@nameservice01:8020/aws/credentials/s3.jceks
.JCEKS on HDFS with Kerberos - You must add the
dfs.namenode.kerberos.principal.pattern
configuration property.Include the following steps when you add an HDFS source or target with Kerberos:
Under Additional Configuration, select Configuration Property Overrides from the dropdown.
Select + Add Key/Value Pair and add the key
dfs.namenode.kerberos.principal.pattern
and the value*
.Select Save, then restart Data Migrator.
noteIf you remove filesystems configured with JCEKS authentication, remove any Amazon S3 filesystems before you remove an HDFS source.
S3 Service Endpoint - The endpoint for the source AWS S3 bucket. See
--endpoint
in the S3A parameters.Simple Queue Service (SQS) Endpoints (Optional)
Data Migrator listens to the event queue to continually migrate changes from source file paths to target filesystem(s).
If you add an S3 source, you have 3 options regarding the queue:Add the source without a queue. Data Migrator creates a queue automatically.
If you want Data Migrator to create its own queue, ensure your account has the necessary permissions to create and manage SQS queues and attach them to S3 buckets.Add the source and enter a queue but no endpoint. This allows you to use a queue that exists in a public endpoint.
If you define your own queue, the queue must be attached to the S3 bucket.
For more information about adding queues to buckets, see the AWS documentation.Add the source and enter a queue and a service endpoint. The endpoint can be a public or a private endpoint.
For more information about public endpoints, see the Amazon SQS endpoints documentation.Queue - Enter the name of your SQS queue. This field is mandatory if you enter an SQS endpoint.
Endpoint - Enter the URL that you want Data Migrator to use. Note if you're using a Virtual Private Network (VPC), you must enter an endpoint.
noteYou can set an Amazon Simple Notification Service (Amazon SNS) topic as the delivery target of the S3 event.
Ensure you enable raw message delivery when you subscribe the SQS queue to the SNS topic.
For more information, see the Amazon SNS documentation.
Migration events expire after 14 daysData Migrator uses SQS messages to track changes to an S3 source filesystem. The maximum retention time for SQS messages is 14 days, which means events are lost after that time and can't be read by a migration.
If you haven't used Data Migrator or have paused your S3 migrations for 14 days, we recommend you reset your S3 migrations.
Purge your SQS queueYour SQS queue starts to capture events as soon as it's created and live. After queue creation, it may capture irrelevant events up to the time you start your first migration. As Data Migrator will need to consume these events, we recommend you purge your SQS queue just prior to first use.
S3A Properties (Optional) - Override properties or enter additional properties by adding key/value pairs.
Filesystem Options
Live Migration - After existing data is moved, changes made to the source filesystem are migrated in real time using an SQS queue.
One-Time Migration - Existing data is moved to the target, after which the migration is complete and no further changes are migrated.
If a file is removed from the source after a one-time migration has completed the initial scan, subsequent rescans will not remove the file from the target. Rescanning will only update existing files or add new ones; it will not remove anything from the target.
One exception to this is the removal of a partition. Removing a partition is an action taken in Hive, and the metadata change will be replicated live by Hive Migrator. The data under that partition will remain on the target regardless of whether it has been deleted on the source. However, since the partition was removed in the metadata, the data inside won't be visible to queries on the target.
Configure Amazon S3 as a source with the CLI
Add an Amazon S3 filesystem
Use the filesystem add s3a
command with the following parameters:
filesystem add s3a [--file-system-id] string
[--bucket-name] string
[--endpoint] string
[--access-key] string
[--secret-key] string
[--sqs-queue] string
[--sqs-endpoint] string
[--credentials-provider] string
[--source]
[--scan-only]
[--properties-files] list
[--properties] string
[--s3type] string
[--bootstrap.servers] string
[--topic] string
For guidance about access, permissions, and security when adding an Amazon S3 bucket as a target filesystem, see Security best practices in IAM.
S3A mandatory parameters
--file-system-id
The ID for the new filesystem resource.--bucket-name
The name of your S3 bucket.--credentials-provider
The Java class name of a credentials provider for authenticating with the Amazon S3 endpoint. In the UI, this is called Credentials Provider. The provider options include:org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
Use this provider to offer credentials as an access key and secret access key with the
--access-key
and--secret-key
parameters.com.amazonaws.auth.InstanceProfileCredentialsProvider
Use this provider when running Data Migrator on an Elastic Compute Cloud (EC2) instance that has an IAM role assigned with policies to access the Amazon S3 bucket.
com.amazonaws.auth.DefaultAWSCredentialsProviderChain
A commonly-used credentials provider chain that looks for credentials in this sequence:
- Environment Variables -
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
, orAWS_ACCESS_KEY
andAWS_SECRET_KEY
. - Java System Properties -
aws.accessKeyId
andaws.secretKey
. - Web Identity Token credentials from the environment or container.
- Credential profiles file at the default location (
~/.aws/credentials
) shared by all AWS SDKs and the AWS CLI. - Credentials delivered through the Amazon EC2 container service if the
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
environment variable is set and the security manager has permission to access the variable. - Instance profile credentials delivered through the Amazon EC2 metadata service.
- Environment Variables -
com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider
This provider supports the use of multiple AWS credentials, which are stored in a credentials file.
When adding a source filesystem, use the following properties:
awsProfile - Name for the AWS profile.
awsCredentialsConfigFile - Path to the AWS credentials file. The default path is
~/.aws/credentials
.For example:
filesystem add s3a --file-system-id testProfile1Fs --bucket-name profile1-bucket --credentials-provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider --properties awsProfile=<profile-name>,
awsCredentialsConfigFile=</path/to/the/aws/credentials" file>In the CLI, you can also use
--aws-profile
and--aws-config-file
.For example:
filesystem add s3a --file-system-id testProfile1Fs --bucket-name profile1-bucket --credentials-provider com.wandisco.livemigrator2.fs.ExtendedProfileCredentialsProvider --aws-profile <profile-name>
--aws-config-file </path/to/the/aws/credentials/file>Learn more about using AWS profiles: Configuration and credential file settings.
S3A optional parameters
--access-key
When using theorg.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
credentials provider, enter the access key with this parameter.--secret-key
When using theorg.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
credentials provider, enter the secret key using this parameter.--endpoint
Enter a specific endpoint to access the bucket, such as an AWS PrivateLink endpoint. If you don't enter a value, the filesystem defaults to AWS.
Using --endpoint
supercedes fs.s3a.endpoint
.
--sqs-queue
[Amazon S3 as a source only] Enter an SQS queue name. This field is required if you enter an SQS endpoint. If you provide a value for this parameter, and create a migration from the Amazon S3 source, the migration is by default a live migration.--sqs-endpoint
[Amazon S3 as a source only] Enter an SQS endpoint.--source
Enter this parameter to add the filesystem as a source.--scan-only
Enter this parameter to create a static source filesystem for one-time migrations. Requires--source
.--properties-files
Reference a list of existing properties files, each containing Hadoop configuration properties in the format used bycore-site.xml
orhdfs-site.xml
.--properties
Enter properties to use in a comma-separated key/value list.--s3type
Enter the valueaws
.
For information on properties that are added by default for new S3A filesystems, see the Command reference s3a default properties.
For information on properties that you can customize for new S3A filesystems, see the Command reference s3a custom properties.
Update the Amazon S3 filesystem
To update existing filesystems, first stop all migrations associated with them. After saving updates, restart the Data Migrator service for your changes to take effect. In most supported Linux distributions, run the command service livedata-migrator restart
.
Update the source filesystem with the following commands:
Command | Action |
---|---|
source clear | Delete all sources |
source delete | Delete one source |
source show | View the source filesystem configuration |
When migrating files from an S3 source to an HDFS target, the user who writes the files is the file owner. In Data Migrator, this is the user that is mapped to the principal used to authenticate with the target.
S3 object stores don't retain owner (RWX) permissions. Anything migrated from an S3 object store to an HDFS target has rwxrwxrwx
permissions.
Next steps
Configure a target filesystem to migrate data to. Then create a migration.