Skip to main content
Version: 2.2

Connect to source and target metastores

Ready to migrate metadata? Hive Migrator, which comes bundled with Data Migrator, lets you transfer metadata from a source metastore to any number of target metastores. Connect to metastores by creating local or remote metadata agents. If you're creating a remote metadata agent, you have to use the WANdisco® CLI for this.

info

You must configure a source or target filesystem before you can connect to a metastore.

Migrating transactional tables

The currently supported use-cases for migration of transactional tables are: Hive 3 to Hive 3 or Hive 3 to Azure SQL.

Transactional tables may take longer to appear on the target cluster than expected. Hive Migrator uses a cautious approach to ensure data integrity. The following conditions must be met for table data to appear on the target:

  • All corresponding data files are migrated.
  • The table's transaction writeId is updated, confirming that all data files are on the target.

Hive Migrator uses migration gates to ensure data files are in place before meeting the second condition. Coming improvements to migration gates will change the conditions so table migrations may proceed without the need for the data migration to be live, reducing migration times.

Constraints

Migrations that include Hive constraints are not supported.

HBase Storage Handler

Hive Migrator does not currently support migrating tables with HBase Storage Handlers defined. However, you can migrate HBase data held within HDFS using data migrations, making it possible to re-create a target HBase environment that relies on that content. See the following known issue regarding issues with the HBase Storage Handler.

Hive Migrator internal database

See the current known issue regarding the size of the internal database used by Hive Migrator.

Connect to metastores with the UI

Remote agent

A Remote Metastore Agent is a separate service deployed on a remote host that connects to Data Migrator and the target metastore. To add a remote agent, add the agent with the CLI and deploy the remote server to the remote host. See Deploy remote agent with the CLI to add a remote agent with the CLI.

Note: A remote agent must be deployed on the target cluster if the source and target run different major Hive versions or with Google Dataproc as a target.

Apache Hive

Review the basic Prerequisites for Apache Hive before you begin.

CDP 7.1.8

If you’re connecting to a Hive metastore agent and using CDP 7.1.8 with Postgresql, you need to create a symlink to postgresql-jdbc.jar. See Missing PostgreSQL driver for more information.

  1. From the Dashboard, select a product under Products.

    info

    Data Migrator will attempt to auto-discover Apache Hive and create a metadata agent for your Hadoop source filesystem. Check whether an existing agent is listed under the Agents panel.

    Auto-discovery will fail if Kerberos is enabled.

  2. Under the Filesystems & Agents menu, select Metastore Agents.

  3. Select Connect to Metastore.

  4. Select the Filesystem in which the data associated with the metadata is held. For Hive agents, this will likely be the Hadoop Distributed File System (HDFS) which contains the data for your tables.

  5. Enter a Display Name.

  6. (Optional, required when using a local agent for a target filesystem) - Enter a value for Configuration Path. The default path will be used if left blank.

    note

    Leave empty for a local, source Hive metastore agent. Data Migrator will autodetect Hive configuration in /etc/hive/conf when your local agent is located on a Hive client node, the parameter won't be required and shouldn't be configured.
    For a local agent for a target metastore or when Hive config is not located in /etc/hive/conf, supply a path containing the hive-site.xml, core-site.xml, and hdfs-site.xml for that specific cluster.

  7. (Optional) - Enter Kerberos Configuration. Use the Hive service principal hive/hostname@REALM or a principal of similar permission. The keytab must be readable by the user running the Hive Migrator process and contain the appropriate principal.

  8. (Optional) - Select Override JDBC Connection Properties to override the JDBC properties used to connect to the Hive metastore database. You'll need to enable this option for migrating transactional, managed tables on Hive 3+ on CDP Hadoop clusters.

    Enter the following details for both source and target agents:

    • Connection URL: The JDBC URL for the database.

    • Connection Driver Name: The full class of the JDBC driver. For example, org.postgresql.Driver.

    • Connection Username: The username for your metastore database.

    • Connection Password: The password for your metastore database.

      info

      If you're using MariaDB or MySQL, you need to manually add the JDBC driver to the classpath. See Manual JDBC driver configuration for more information.

  9. (Optional) - Enter Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.

  10. Select Save.

Preferred Operation Mode

After creating your agent, select a preferred operation mode to manage how metadata changes are detected. Select your preferred operation mode before using the agent with any metadata migrations.

Azure SQL DB

The Azure SQL DB agent integrates directly with the external metastore of an HDInsight cluster. A HDI cluster can be spun up before or after the agent is created, and the metadata can be made available to it via the Azure SQL DB as its external metastore.

  1. Add the IP address of the Data Migrator host as a Azure SQL Server firewall rule.

  2. From the Dashboard, select a product under Products.

  3. Under the Filesystems & Agents menu, select Metastore Agents.

  4. Select Connect to Metastore.

  5. Select the Filesystem in which the data associated with the metadata is held. For Azure SQL agents, this will likely be an ADLS2 Container.

  6. Select Azure SQL DB as the Metastore Type.

  7. Enter a Display Name.

  8. Enter the Azure SQL Server Name.

  9. Enter the Azure SQL Database Name.

    note

    Hive Migrator doesn’t support Azure SQL database names containing blank spaces ( ) or hyphens (-).

  10. Enter the ADLS Gen2 Storage Account Name and Container Name.

  11. Select the Authentication Method.

    note

    If you're using the SQL Password authentication method, you’ll need to reenter the SQL database password when updating this agent.

    caution

    The use of the quotation mark character is not supported when included in the Database Password. See the Known issue for more information.

  12. (Optional) - Enter a Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.

  13. Select Save.

AWS Glue Data Catalog

info

Migrations that include Hive constraints are not supported.
AWS Glue doesn't support metadata migrations that include Hive constraints.

AWS Glue Data Catalog allows a maximum of 100 objects per request.
When you're using it as a target, make the following change to avoid metadata migration failures due to hitting this limit:

  1. Add the property hivemigrator.migrationBatchSize=100 to /etc/wandisco/hivemigrator/application.properties.
  2. Restart the Hive Migrator service using the command: service hivemigrator restart.
  1. From the Dashboard, select a product under Products.

  2. Under the Filesystems & Agents menu, select Metastore Agents.

  3. Select Connect to Metastore.

  4. Select the Filesystem in which the data associated with the metadata is held. For AWS Glue agents, this will likely be an S3 object store.

  5. Select AWS Glue as the Metastore Type.

  6. Enter a Display Name.

  7. Select the AWS Catalog Credentials Provider.

    note

    If you're using the Access Key and Secret credentials provider, you’ll need to reenter the access and secret keys when updating this agent.

  8. Enter a Virtual Private Cloud or AWS Glue Service endpoint.

  9. Enter the AWS Region.

  10. (Optional) - Enter a Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.

  11. Select Save.

Connect to metastores with the CLI

To deploy remote metadata agents, you must connect to the CLI.

  • Apache Hive
  • Azure SQL DB
  • AWS Glue Data Catalog
info

It's not possible to adjust some TLS parameters for remote metastore agents after creation. Find more information in the following Knowledge base article.

Apache Hive

CommandAction
hive agent add hiveAdd a Hive agent for a local or remote Apache Hive Metastore
hive agent configure hiveChange the configuration of an existing Hive agent for the Apache Hive Metastore
hive agent checkCheck whether the Hive agent can connect to the Metastore
hive agent deleteDelete a Hive agent
hive agent listList all configured Hive agents
hive agent showShow the configuration for a Hive agent
hive agent typesList supported Hive agent types

Azure SQL DB

CommandAction
hive agent add azureAdd a Hive agent for an Azure SQL connection
hive agent configure azureChange the configuration of an existing Hive agent for the Azure SQL database server
hive agent checkCheck whether the Hive agent can connect to the Metastore
hive agent deleteDelete a Hive agent
hive agent listList all configured Hive agents
hive agent showShow the configuration for a Hive agent
hive agent typesList supported Hive agent types

AWS Glue Data Catalog

CommandAction
hive agent add glueAdd a Hive agent for an AWS Glue Data Catalog
hive agent configure glueChange the configuration of an existing Hive agent for the AWS Glue Data Catalog
hive agent checkCheck whether the Hive agent can connect to the Metastore
hive agent deleteDelete a Hive agent
hive agent listList all configured Hive agents
hive agent showShow the configuration for a Hive agent
hive agent typesList supported Hive agent types

Connect to metastores and deploy remote agents with the CLI

A Remote Metastore Agent is a separate service deployed on a remote host that connects to Data Migrator and the target metastore.

To add a remote agent, add the agent with the hive agent add hive command and install the remote server with the /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh bundled with Data Migrator.

Note: A remote agent must be deployed on the target cluster if the source and target run different major Hive versions or with Google Dataproc as a target.

When deploying a remote agent on an environment where Hive uses MySQL, the JDBC Driver for MySQL must be copied into /opt/wandisco/hivemigrator-remote-server/agent/hive and made executable on the remote server.

note

A Remote Metastore Agent is a stateful service designed primarily to ensure data consistency. Due to its inherent statefulness, a remote agent functions optimally without additional load balancing or custom failover mechanisms. Please note that using load balancing or failover mechanisms with Remote Metastore Agents is not supported.

Configure a secure Transport Layer Security (TLS)/Secure Sockets Layer (SSL) keystore connection to a remote agent

You can set up a keystore to enable TLS/SSL between Hive Migrator and your remote agents.

See Configure a secure TLS/SSL keystore connection to a remote agent for more information.

CDP 7.1.8

If you’re connecting to a Hive metastore agent and using CDP 7.1.8 with Postgresql, you need to create a symlink to postgresql-jdbc.jar. See Missing PostgreSQL driver for more information.

Apache Hive

info

If you enter Kerberos and configuration path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).

Follow these steps to deploy a remote Hive agent for Apache Hive:

  1. Add the agent to Data Migrator.

    On your local host, run the hive agent add hive command with the following parameters to configure your remote Hive agent.

    • --name The ID for the new Hive agent.
    • --host The host where the remote Hive agent will be deployed.
    • --port The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.
    • --no-ssl (Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
    • --file-system-id The name of the filesystem that will be associated with this agent.
    • --default-fs-override (Optional) Override for the default filesystem URI instead of a filesystem name.

    If the configuration files are not located on the default path, use one of the following parameters:

    • --config-path The path to the directory containing the Hive configuration files.
    • --config-file A file containing all the paths as a comma-delimited list.

    If Kerberos is enabled, use the following parameters:

    • --kerberos-principal Kerberos principal to access the remote Hive service.
    • --kerberos-keytab Path to the Kerberos keytab on the remote host containing the principal to access the remote Hive service.

    You will require JDBC details to migrate transactional, managed tables on Hive 3+ on a CDP Hadoop Cluster. Override JDBC connection properties with the following parameters:

    • --jdbc-url The JDBC URL for the database.
    • --jdbc-driver-name The full class name of the JDBC driver.
    • --jdbc-username The username for the database.
    • --jdbc-password the password to connect to the database.
    Example
    hive agent add hive --name targetmanualAgent --host myRemoteHost.example.com --port 5052 --kerberos-keytab /path/to/the/hive.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path /<example directory path> --file-system-id mytargethdfs
    info

    Skip steps 2, 3, and 4 to automatically deploy the remote agent using the hive agent add hive command and parameters with the automated deployment parameters.

    You must set up root SSH access from the host to the remote agent host.

  2. Copy the installer file.

    Transfer the remote server installer to your remote host:

    Example
    scp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
  3. Run the installer.

    On your remote host, run the installer as root (or sudo) user in silent mode. You'll find the --config value string in the output of the hive agent add command in step 1, or from the installationCommand value from the output of the hive agent show --name <agent name> command.

    Example
    ./hivemigrator-remote-server-installer.sh -- --silent --config <example config string>
  4. Start the service.

    On your remote host, start the remote server service:

    Example
    service hivemigrator-remote-server start
  5. Confirm health.

    Check the agent health status with the hive agent check command in the CLI.

    hive agent check --name targetmanualAgent example output
    {
    "name": "targetmanualAgent",
    "location": "REMOTE",
    "config": {
    "agentType": "HIVE",
    "remoteAgentConfig": {
    ...

See the troubleshooting guide for more information.

Azure

Follow these steps to deploy a remote Hive agent for Azure:

  1. On your local host, run the hive agent add azure command with the following parameters to configure your remote Hive agent.

    • --name The ID for the new Hive agent.
    • --db-server-name The Azure SQL database server name.
    • --database-name The Azure SQL database name.
      note

      Hive Migrator doesn’t support Azure SQL database names containing blank spaces ( ) or hyphens (-).

    • --storage-account The name of the ADLS Gen2 storage account.
    • --container-name The name of the container in the ADLS Gen2 storage account.
    • --auth-method Azure SQL database connection authentication method.
    • --host The host where the remote Hive agent will be deployed.
    • --port The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.
    • --no-ssl (Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
    • --file-system-id The name of the filesystem that will be associated with this agent.
    • --default-fs-override (Optional) Override for the default filesystem URI instead of a filesystem name.
    • --insecure (Optional) Define an insecure connection (TLS disabled) to the Azure SQL database server.

    If you're using the SQL_PASSWORD authentication method, use the following parameters:

    • --database-user The username to access the database.
    • --database-password The user password to access the database.

    If you're using the AD_MSI authentication method with user-assigned identity, use the following parameter:

    • --client-id The client ID of your Azure-managed identity.
    Example
    hive agent add azure --name azureRemoteAgent --db-server-name mysqlserver.database.windows.net --database-name mydb1 --auth-method AD_MSI --client-id b67f67ex-ampl-e2eb-bd6d-client9385id --storage-account myadls2 --container-name mycontainer --file-system-id myadls2storage --host myRemoteHost.example.com --port 5052
    info

    Skip steps 2, 3, and 4 to automatically deploy the remote agent using the hive agent add azure command and parameters with the automated deployment parameters.

    Set up root SSH access from the host to the remote agent host.

  2. Transfer the remote server installer to your remote host (Azure VM, HDI cluster node):

    Example of secure transfer from local to remote host
    scp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
  3. On your remote host, run the installer as root (or sudo) user in silent mode:

    Example
    ./hivemigrator-remote-server-installer.sh -- --silent --config <example config string>

    Find the --config string in the output of the hive agent add command in step 1.

  4. On your remote host, start the remote server service:

    Example
    service hivemigrator-remote-server start
info

If you enter Kerberos and configuration path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).

AWS Glue Data Catalog

Follow these steps to deploy a remote Hive agent for AWS Glue:

  1. On your local host, run the hive agent add glue command with the following parameters to configure your remote Hive agent.

    • --name The ID for the new Hive agent.
    • --glue-endpoint The AWS Glue service endpoint for connections to the data catalog.
    • --aws-region The AWS region that your data catalog is located in (default is us-east-1).
    • --host The host where the remote Hive agent will be deployed.
    • --port The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.
    • --no-ssl (Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
    • --file-system-id The name of the filesystem that will be associated with this agent.
    • --default-fs-override (Optional) Override for the default filesystem URI instead of a filesystem name.

    Optionally you can specify the AWS catalog credentials provider factory class. If you don't use this parameter, the default is DefaultAWSCredentialsProviderChain. If you enter the --access-key and --secret-key parameters, the credentials provider defaults automatically to StaticCredentialsProviderFactory.

    If you're using the Static Credentials Provider Factory class, use the following parameters:

    • --access-key The AWS access key. In the UI, this is called Access Key.
    • --secret-key The AWS secret key. In the UI, this is called Secret Key.

    To specify optional Glue parameters, use the following:

    • --glue-catalog-id The AWS account ID to access the data catalog. This is used if the data catalog is owned by a different account than the one provided by the credentials provider and cross-account access has been granted.
    • --glue-max-retries The maximum number of retries the Glue client performs after an error.
    • --glue-max-connections The maximum number of parallel connections the Glue client allocates.
    • --glue-max-socket-timeout The maximum time the Glue client allows for an established connection to time out.
    • --glue-connection-timeout The maximum time the Glue client allows to establish a connection.
    Example
    hive agent add glue --name glueAgent --access-key ACCESS6HCFPAQIVZTKEY --secret-key SECRET1vTMuqKOIuhET0HAI78UIPfSRjcswTKEY --glue-endpoint glue.eu-west-1.amazonaws.com --aws-region eu-west-1 --file-system-id mys3bucket --host myRemoteHost.example.com --port 5052
    info

    Skip steps 2, 3, and 4 to automatically deploy the remote agent using the hive agent add glue command and parameters with the automated deployment parameters.

    Set up root SSH access from the host to the remote agent host.

  2. Transfer the remote server installer to your remote host (Amazon EC2 instance):

    Example of secure transfer from local to remote host
    scp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
  3. On your remote host, run the installer as root (or sudo) user in silent mode:

    ./hivemigrator-remote-server-installer.sh -- --silent --config <example config string>

    Find the --config string in the output of the hive agent add command in step 1.

  4. On your remote host, start the remote server service:

    service hivemigrator-remote-server start