Connect to source and target metastores
Ready to migrate metadata? Hive Migrator, which comes bundled with Data Migrator, lets you transfer metadata from a source metastore to any number of target metastores. Connect to metastores by creating local or remote metadata agents. If you're creating a remote metadata agent, you have to use the Data Migrator CLI for this.
Supported metadata sources are: Apache Hive and AWS Glue Data Catalog.
Supported metadata targets are: Apache Hive, Azure SQL DB, AWS Glue Data Catalog, Databricks, Google Dataproc, and Snowflake.
To configure Databricks as a target, see Configure Databricks as a target.
To configure Google Dataproc as a target, see Configure Google Dataproc as a target.
To configure Snowflake as a target, see Configure Snowflake as a target.
The currently supported use-cases for migration of transactional tables are: Hive 3 to Hive 3 or Hive 3 to Azure SQL.
Transactional tables may take longer to appear on the target cluster than expected. Hive Migrator uses a cautious approach to ensure data integrity. The following conditions must be met for table data to appear on the target:
- All corresponding data files are migrated.
- The table's transaction
writeId
is updated, confirming that all data files are on the target.
Hive Migrator uses migration gates to ensure data files are in place before meeting the second condition. Coming improvements to migration gates will change the conditions so table migrations may proceed without the need for the data migration to be live, reducing migration times.
Migrations that include Hive constraints are not supported.
Hive Migrator does not currently support migrating tables with HBase Storage Handlers defined. However, you can migrate HBase data held within HDFS using data migrations, making it possible to re-create a target HBase environment that relies on that content. See the following known issue regarding issues with the HBase Storage Handler.
See the current known issue regarding the size of the internal database used by Hive Migrator.
Connect to metastores with the UI
A Remote Metastore Agent is a separate service deployed on a remote host that connects to Data Migrator and the target metastore. To add a remote agent, add the agent with the CLI and deploy the remote server to the remote host. See Deploy remote agent with the CLI to add a remote agent with the CLI.
Note: A remote agent must be deployed on the target cluster if the source and target run different major Hive versions or with Google Dataproc as a target.
Apache Hive
Review the basic Prerequisites for Apache Hive before you begin.
If you’re connecting to a Hive metastore agent and using CDP 7.1.8 with Postgresql, you need to create a symlink to postgresql-jdbc.jar
. See Missing PostgreSQL driver for more information.
From the Dashboard, select an instance under Instances.
infoData Migrator will attempt to auto-discover Apache Hive and create a metadata agent for your Hadoop source filesystem. Check whether an existing agent is listed under the Agents panel.
Auto-discovery will fail if Kerberos is enabled.
Under the Filesystems & Agents menu, select Metastore Agents.
Select Connect to Metastore.
Select the Filesystem in which the data associated with the metadata is held. For Hive agents, this will likely be the Hadoop Distributed File System (HDFS) which contains the data for your tables.
Enter a Display Name.
(Optional, required when using a local agent for a target filesystem) - Enter a value for Configuration Path. The default path will be used if left blank.
noteLeave empty for a local, source Hive metastore agent. Data Migrator will autodetect Hive configuration in /etc/hive/conf when your local agent is located on a Hive client node, the parameter won't be required and shouldn't be configured.
For a local agent for a target metastore or when Hive config is not located in /etc/hive/conf, supply a path containing the hive-site.xml, core-site.xml, and hdfs-site.xml for that specific cluster.(Optional) - Enter Kerberos Configuration. Use the Hive service principal
hive/hostname@REALM
or a principal of similar permission. The keytab must be readable by the user running the Hive Migrator process and contain the appropriate principal.(Optional) - Select Override JDBC Connection Properties to override the JDBC properties used to connect to the Hive metastore database. You'll need to enable this option for migrating transactional, managed tables on Hive 3+ on CDP Hadoop clusters.
Enter the following details for both source and target agents:
Connection URL: The JDBC URL for the database.
Connection Driver Name: The full class of the JDBC driver. For example,
org.postgresql.Driver
.Connection Username: The username for your metastore database.
Connection Password: The password for your metastore database.
infoIf you're using MariaDB or MySQL, you need to manually add the JDBC driver to the classpath on your Hivemigrator host. See Manual JDBC driver configuration for more information.
(Optional) - Enter Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
After creating your agent, select a preferred operation mode to manage how metadata changes are detected. Select your preferred operation mode before using the agent with any metadata migrations.
Azure SQL DB
The Azure SQL DB agent integrates directly with the external metastore of an HDInsight cluster. A HDI cluster can be spun up before or after the agent is created, and the metadata can be made available to it via the Azure SQL DB as its external metastore.
Add the IP address of the Data Migrator host as a Azure SQL Server firewall rule.
From the Dashboard, select an instance under Instances.
Under the Filesystems & Agents menu, select Metastore Agents.
Select Connect to Metastore.
Select the Filesystem in which the data associated with the metadata is held. For Azure SQL agents, this will likely be an ADLS2 Container.
Select Azure SQL DB as the Metastore Type.
Enter a Display Name.
Enter the Azure SQL Server Name.
Enter the Azure SQL Database Name.
noteHive Migrator doesn’t support Azure SQL database names containing blank spaces (
-
), semicolons (;
), open curly braces ({
) or close curly braces (}
). Additionaly, see Microsoft's documentation for a list special characters which can't be used.Enter the ADLS Gen2 Storage Account Name and Container Name.
Select the Authentication Method.
noteIf you're using the SQL Password authentication method, you’ll need to reenter the SQL database password when updating this agent.
cautionThe use of the quotation mark character is not supported when included in the Database Password. See the Known issue for more information.
(Optional) - Enter a Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
AWS Glue Data Catalog
Migrations that include Hive constraints are not supported.
AWS Glue doesn't support metadata migrations that include Hive constraints.
AWS Glue Data Catalog allows a maximum of 100 objects per request.
When you're using it as a target, make the following change to avoid metadata migration failures due to hitting this limit:
- Add the property
hivemigrator.migrationBatchSize=100
to/etc/wandisco/hivemigrator/application.properties
. - Restart the Hive Migrator service using the command:
service hivemigrator restart
.
From the Dashboard, select an instance under Instances.
Under the Filesystems & Agents menu, select Metastore Agents.
Select Connect to Metastore.
Select the Filesystem in which the data associated with the metadata is held. For AWS Glue agents, this will likely be an S3 object store.
Select AWS Glue as the Metastore Type.
Enter a Display Name.
Select the AWS Catalog Credentials Provider.
noteIf you're using the Access Key and Secret credentials provider, you’ll need to reenter the access and secret keys when updating this agent.
Enter a Virtual Private Cloud or AWS Glue Service endpoint.
Enter the AWS Region.
(Optional) - Enter a Default Filesystem Override to override the default filesystem URI. Recommended for complex use cases only.
Select Save.
Connect to metastores with the CLI
To deploy remote metadata agents, you must connect to the CLI.
- Apache Hive
- Azure SQL DB
- AWS Glue Data Catalog
It's not possible to adjust some TLS parameters for remote metastore agents after creation. Find more information in the following Knowledge base article.
Apache Hive
Command | Action |
---|---|
hive agent add hive | Add a Hive agent for a local or remote Apache Hive Metastore |
hive agent configure hive | Change the configuration of an existing Hive agent for the Apache Hive Metastore |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
Azure SQL DB
Command | Action |
---|---|
hive agent add azure | Add a Hive agent for an Azure SQL connection |
hive agent configure azure | Change the configuration of an existing Hive agent for the Azure SQL database server |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
AWS Glue Data Catalog
Command | Action |
---|---|
hive agent add glue | Add a Hive agent for an AWS Glue Data Catalog |
hive agent configure glue | Change the configuration of an existing Hive agent for the AWS Glue Data Catalog |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |
Connect to metastores and deploy remote agents with the CLI
A Remote Metastore Agent is a separate service deployed on a remote host that connects to Data Migrator and the target metastore.
To add a remote agent, add the agent with the hive agent add hive
command and install the remote server with the /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh
bundled with Data Migrator.
Note: A remote agent must be deployed on the target cluster if the source and target run different major Hive versions or with Google Dataproc as a target.
When deploying a remote agent on an environment where Hive uses MySQL, the JDBC Driver for MySQL must be copied into /opt/wandisco/hivemigrator-remote-server/agent/hive
and made executable on the remote server.
A Remote Metastore Agent is a stateful service designed primarily to ensure data consistency. Due to its inherent statefulness, a remote agent functions optimally without additional load balancing or custom failover mechanisms. Please note that using load balancing or failover mechanisms with Remote Metastore Agents is not supported.
You can set up a keystore to enable TLS/SSL between Hive Migrator and your remote agents.
See Configure a secure TLS/SSL keystore connection to a remote agent for more information.
If you’re connecting to a Hive metastore agent and using CDP 7.1.8 with Postgresql, you need to create a symlink to postgresql-jdbc.jar
. See Missing PostgreSQL driver for more information.
Apache Hive
If you enter Kerberos and configuration path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).
Follow these steps to deploy a remote Hive agent for Apache Hive:
Add the agent to Data Migrator.
On your local host, run the
hive agent add hive
command with the following parameters to configure your remote Hive agent.--name
The ID for the new Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.--file-system-id
The name of the filesystem that will be associated with this agent.--default-fs-override
(Optional) Override for the default filesystem URI instead of a filesystem name.
If the configuration files are not located on the default path, use one of the following parameters:
--config-path
The path to the directory containing the Hive configuration files.--config-file
A file containing all the paths as a comma-delimited list.
If Kerberos is enabled, use the following parameters:
--kerberos-principal
Kerberos principal to access the remote Hive service.--kerberos-keytab
Path to the Kerberos keytab on the remote host containing the principal to access the remote Hive service.
You will require JDBC details to migrate transactional, managed tables on Hive 3+ on a CDP Hadoop Cluster. Override JDBC connection properties with the following parameters:
--jdbc-url
The JDBC URL for the database.--jdbc-driver-name
The full class name of the JDBC driver.--jdbc-username
The username for the database.--jdbc-password
the password to connect to the database.
Examplehive agent add hive --name targetmanualAgent --host myRemoteHost.example.com --port 5052 --kerberos-keytab /path/to/the/hive.keytab --kerberos-principal hive/_HOST@REMOTEREALM.COM --config-path /<example directory path> --file-system-id mytargethdfs
infoSkip steps 2, 3, and 4 to automatically deploy the remote agent using the
hive agent add hive
command and parameters with the automated deployment parameters.You must set up root SSH access from the host to the remote agent host.
Copy the installer file.
Transfer the remote server installer to your remote host:
Examplescp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
Run the installer.
On your remote host, run the installer as root (or sudo) user in silent mode. You'll find the
--config
value string in the output of thehive agent add
command in step 1, or from theinstallationCommand
value from the output of thehive agent show --name <agent name>
command.Example./hivemigrator-remote-server-installer.sh -- --silent --config <example config string>
Start the service.
On your remote host, start the remote server service:
Exampleservice hivemigrator-remote-server start
Confirm health.
Check the agent health status with the
hive agent check
command in the CLI.hive agent check --name targetmanualAgent example output{
"name": "targetmanualAgent",
"location": "REMOTE",
"config": {
"agentType": "HIVE",
"remoteAgentConfig": {
...
See the troubleshooting guide for more information.
Azure
Follow these steps to deploy a remote Hive agent for Azure:
On your local host, run the
hive agent add azure
command with the following parameters to configure your remote Hive agent.--name
The ID for the new Hive agent.--db-server-name
The Azure SQL database server name.--database-name
The Azure SQL database name.noteHive Migrator doesn’t support Azure SQL database names containing blank spaces (
-
), semicolons (;
), open curly braces ({
) or close curly braces (}
). Additionaly, see Microsoft's documentation for a list special characters which can't be used.--storage-account
The name of the ADLS Gen2 storage account.--container-name
The name of the container in the ADLS Gen2 storage account.--auth-method
Azure SQL database connection authentication method.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.--file-system-id
The name of the filesystem that will be associated with this agent.--default-fs-override
(Optional) Override for the default filesystem URI instead of a filesystem name.--insecure
(Optional) Define an insecure connection (TLS disabled) to the Azure SQL database server.
If you're using the SQL_PASSWORD authentication method, use the following parameters:
--database-user
The username to access the database.--database-password
The user password to access the database.
If you're using the AD_MSI authentication method with user-assigned identity, use the following parameter:
--client-id
The client ID of your Azure-managed identity.
Examplehive agent add azure --name azureRemoteAgent --db-server-name mysqlserver.database.windows.net --database-name mydb1 --auth-method AD_MSI --client-id b67f67ex-ampl-e2eb-bd6d-client9385id --storage-account myadls2 --container-name mycontainer --file-system-id myadls2storage --host myRemoteHost.example.com --port 5052
infoSkip steps 2, 3, and 4 to automatically deploy the remote agent using the
hive agent add azure
command and parameters with the automated deployment parameters.Set up root SSH access from the host to the remote agent host.
Transfer the remote server installer to your remote host (Azure VM, HDI cluster node):
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
Example./hivemigrator-remote-server-installer.sh -- --silent --config <example config string>
Find the
--config
string in the output of thehive agent add
command in step 1.On your remote host, start the remote server service:
Exampleservice hivemigrator-remote-server start
If you enter Kerberos and configuration path information for remote agents, ensure that the directories and Kerberos principal are correct for your chosen remote host (not your local host).
AWS Glue Data Catalog
Follow these steps to deploy a remote Hive agent for AWS Glue:
On your local host, run the
hive agent add glue
command with the following parameters to configure your remote Hive agent.--name
The ID for the new Hive agent.--glue-endpoint
The AWS Glue service endpoint for connections to the data catalog.--aws-region
The AWS region that your data catalog is located in (default isus-east-1
).--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.--no-ssl
(Optional) TLS encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.--file-system-id
The name of the filesystem that will be associated with this agent.--default-fs-override
(Optional) Override for the default filesystem URI instead of a filesystem name.
Optionally you can specify the AWS catalog credentials provider factory class. If you don't use this parameter, the default is DefaultAWSCredentialsProviderChain. If you enter the
--access-key
and--secret-key
parameters, the credentials provider defaults automatically to StaticCredentialsProviderFactory.--credentials-provider
The AWS catalog credentials provider factory class.
If you're using the Static Credentials Provider Factory class, use the following parameters:
--access-key
The AWS access key. In the UI, this is called Access Key.--secret-key
The AWS secret key. In the UI, this is called Secret Key.
To specify optional Glue parameters, use the following:
--glue-catalog-id
The AWS account ID to access the data catalog. This is used if the data catalog is owned by a different account than the one provided by the credentials provider and cross-account access has been granted.--glue-max-retries
The maximum number of retries the Glue client performs after an error.--glue-max-connections
The maximum number of parallel connections the Glue client allocates.--glue-max-socket-timeout
The maximum time the Glue client allows for an established connection to time out.--glue-connection-timeout
The maximum time the Glue client allows to establish a connection.
Examplehive agent add glue --name glueAgent --access-key ACCESS6HCFPAQIVZTKEY --secret-key SECRET1vTMuqKOIuhET0HAI78UIPfSRjcswTKEY --glue-endpoint glue.eu-west-1.amazonaws.com --aws-region eu-west-1 --file-system-id mys3bucket --host myRemoteHost.example.com --port 5052
infoSkip steps 2, 3, and 4 to automatically deploy the remote agent using the
hive agent add glue
command and parameters with the automated deployment parameters.Set up root SSH access from the host to the remote agent host.
Transfer the remote server installer to your remote host (Amazon EC2 instance):
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent --config <example config string>
Find the
--config
string in the output of thehive agent add
command in step 1.On your remote host, start the remote server service:
service hivemigrator-remote-server start