Skip to main content
Version: 2.5.4

On-premises Hadoop to Azure HDInsights

These are an outline of the steps needed to ready your environment for migration of data and metadata.

Time to complete: 1 hour (assuming all prerequisites are met).

Prerequisites

On-premises Hadoop cluster

Make sure all prerequisites are met for the source environment. This also includes:

  • Network connectivity between your edge node and your Azure Data Lake Storage (ADLS) Gen2 storage container.
  • If using an Azure SQL Database on your HDInsights cluster, network connectivity between your edge node and this database.

Azure HDInsights cluster

For your target environment, make sure the following prerequisites are met:

  • Your HDInsights cluster is using Azure Data Lake Storage (ADLS) Gen2 as its primary storage type.
  • If using a default metastore, SSH access to an edge node on the HDInsights cluster.
    The edge node requires the following:
    • Hadoop Distributed File System (HDFS) and Hive client libraries installed.
    • A chosen port open for outbound connections (for example: 5552) to communicate with the Data Migrator service on the on-premises Hadoop edge node.

Install Data Migrator on your Hadoop edge node

Install Data Migrator on your Hadoop edge node.

Configuration for data migrations

Add Hadoop Distributed File System (HDFS) as a source filesystem

  1. If Kerberos is enabled on your Hadoop cluster, enter the Kerberos credentials for the HDFS superuser on your Hadoop cluster:

  2. (CLI only) Check that HDFS on your on-premises Hadoop cluster is set as your source filesystem:

    source show

    If the filesystem shown is incorrect, delete it using source delete and configure the source manually:

    filesystem add hdfs

    Ensure to include the --source parameter when using the command above.

Add Azure Data Lake Storage (ADLS) Gen2 storage as a target filesystem

Configure your ADLS Gen2 storage container as your target filesystem. The method chosen will depend on the authentication method:

Create path mapping for default Hive warehouse directory

Create a path mapping to ensure that data for managed Hive databases and tables are migrated to the default Hive warehouse directory for HDInsight clusters.

This lets you start using your source data and metadata on your HDInsights cluster immediately after migration, as it will be referenced correctly by your target metastore.

Configuration for metadata migrations

Add source hive agent

  1. Configure the source hive agent to connect to the Hive metastore on the on-premises Hadoop cluster:

  2. Check that the configuration for the hive agent is correct:

    • UI - the agent will show a healthy connection.

    • CLI

      Example
      hive agent check --name hiveAgent

Add target hive agent

HDInsights can use either a default metastore, or a custom metastore in the form of an Azure SQL Database.

Choose one of the methods below depending on the type of metastore deployed in your HDInsights cluster.

Default metastore

note

For step 1, deploying a remote agent is only possible through the CLI.

  1. Deploy and configure a remote hive agent:

    hive agent add hive

    Use the automated deployment parameters or follow the steps for manual deployment.

    As mentioned in the prerequisites, enter a suitable edge node on your HDInsights cluster to deploy the hive agent service.

  2. Check that the configuration for the hive agent is correct:

    • UI - the agent will show a healthy connection.

    • CLI

      Example
      hive agent check --name azureAgent

Custom metastore (Azure SQL database)

  1. Configure a hive agent to connect to an Azure SQL database:

  2. Check that the configuration for the hive agent is correct:

    • UI - the agent will show a healthy connection.

    • CLI

      Example
      hive agent check --name azureAgent

Next steps

Start defining exclusions and migrating data. You can also create metadata rules and start migrating metadata.