On-premises Hadoop to Azure HDInsights
These are an outline of the steps needed to ready your environment for migration of data and metadata.
Time to complete: 1 hour (assuming all prerequisites are met).
Prerequisites
On-premises Hadoop cluster
Make sure all prerequisites are met for the source environment. This also includes:
- Network connectivity between your edge node and your Azure Data Lake Storage (ADLS) Gen2 storage container.
- If using an Azure SQL Database on your HDInsights cluster, network connectivity between your edge node and this database.
Azure HDInsights cluster
For your target environment, make sure the following prerequisites are met:
- Your HDInsights cluster is using Azure Data Lake Storage (ADLS) Gen2 as its primary storage type.
- If using a default metastore, SSH access to an edge node on the HDInsights cluster.
The edge node requires the following:- Hadoop Distributed File System (HDFS) and Hive client libraries installed.
- A chosen port open for outbound connections (for example: 5552) to communicate with the Data Migrator service on the on-premises Hadoop edge node.
Install Data Migrator on your Hadoop edge node
Install Data Migrator on your Hadoop edge node.
Configuration for data migrations
Add Hadoop Distributed File System (HDFS) as a source filesystem
If Kerberos is enabled on your Hadoop cluster, enter the Kerberos credentials for the HDFS superuser on your Hadoop cluster:
(CLI only) Check that HDFS on your on-premises Hadoop cluster is set as your source filesystem:
If the filesystem shown is incorrect, delete it using
source delete
and configure the source manually:Ensure to include the
--source
parameter when using the command above.
Add Azure Data Lake Storage (ADLS) Gen2 storage as a target filesystem
Configure your ADLS Gen2 storage container as your target filesystem. The method chosen will depend on the authentication method:
- UI
- CLI
- Using a service principal and OAuth 2 credentials:
filesystem add adls2 oauth
- Using access key credentials:
filesystem add adls2 sharedKey
- Using a service principal and OAuth 2 credentials:
Create path mapping for default Hive warehouse directory
Create a path mapping to ensure that data for managed Hive databases and tables are migrated to the default Hive warehouse directory for HDInsight clusters.
This lets you start using your source data and metadata on your HDInsights cluster immediately after migration, as it will be referenced correctly by your target metastore.
Configuration for metadata migrations
Add source hive agent
Configure the source hive agent to connect to the Hive metastore on the on-premises Hadoop cluster:
Check that the configuration for the hive agent is correct:
UI - the agent will show a healthy connection.
CLI
Examplehive agent check --name hiveAgent
Add target hive agent
HDInsights can use either a default metastore, or a custom metastore in the form of an Azure SQL Database.
Choose one of the methods below depending on the type of metastore deployed in your HDInsights cluster.
Default metastore
For step 1, deploying a remote agent is only possible through the CLI.
Deploy and configure a remote hive agent:
Use the automated deployment parameters or follow the steps for manual deployment.
As mentioned in the prerequisites, enter a suitable edge node on your HDInsights cluster to deploy the hive agent service.
Check that the configuration for the hive agent is correct:
UI - the agent will show a healthy connection.
CLI
Examplehive agent check --name azureAgent
Custom metastore (Azure SQL database)
Configure a hive agent to connect to an Azure SQL database:
Check that the configuration for the hive agent is correct:
UI - the agent will show a healthy connection.
CLI
Examplehive agent check --name azureAgent
Next steps
Start defining exclusions and migrating data. You can also create metadata rules and start migrating metadata.