Configure Google Dataproc as a target
Configure Google Dataproc as a target metastore using either the UI or the CLI.
A remote agent is a service deployed on a remote host that connects to Data Migrator to handle metadata transfer. A remote agent must be deployed on the Dataproc cluster with a supported OS. Currently, Ubuntu 20.04 and Ubuntu 18.04 are supported.
Migration of transactional tables to a Google Dataproc metastore target is currently unsupported.
Prerequisites
See the knowledge base article Setting up a Dataproc agent.
Deploy a remote Hive agent for Dataproc with the CLI
On your local host, run the
hive agent add dataproc
command with the following parameters to configure your remote Hive agent.--host
The host where the remote Hive agent will be deployed.--port
The port for the remote Hive agent to use on the remote host. This port is used to communicate with the local Data Migrator server.--no-ssl
(Optional) Transport Layer Security (TLS) encryption and certificate authentication is enabled by default between Data Migrator and the remote agent. Use this parameter to disable it.
Transfer the remote server installer to your remote host:
Example of secure transfer from local to remote hostscp /opt/wandisco/hivemigrator/hivemigrator-remote-server-installer.sh myRemoteHost:~
On your remote host, run the installer as root (or sudo) user in silent mode:
./hivemigrator-remote-server-installer.sh -- --silent
noteThe agent port will default to
5052
. To set a custom agent port, run the installer with the--agent-port
parameter. For example,./hivemigrator-remote-server-installer.sh -- --silent --agent-port <custom port>
.On your remote host, start the remote server service:
service hivemigrator-remote-server start
Example for remote Dataproc deployment - automatedhive agent add dataproc --name targetautoAgent --autodeploy --ssh-user root --ssh-key /root/.ssh/id_rsa --ssh-port 22 --host myRemoteHost.example.com --port 5052 --config-path <example directory path> --file-system-id mytargethdfs
Example for remote Dataproc deployment - manualhive agent add dataproc --name targetmanualAgent --host myRemoteHost.example.com --port 5052 --config-path <example directory path> --file-system-id mytargethdfs
If you enter Kerberos and configuration path information for remote agents, ensure the directories and Kerberos principal are correct for your chosen remote host (not your local host).
Configure a secure TLS/SSL keystore connection to an agent
You can set up a keystore to enable TLS between Hive Migrator and remote agents.
See Configure a secure TLS/SSL keystore connection to a remote agent for more information.
- UI
- CLI
Configure Google Dataproc with the UI
From the Dashboard, select an instance under Instances.
Under the Filesystems & Agents menu, select Metastore Agents.
Select Connect to Metastore.
Select the Filesystem in which the data associated with the metadata is held.
For Dataproc agents, this is usually a Google Cloud Storage bucket.Select Google Cloud Dataproc as the Metastore Type.
Download the installer to the Dataproc cluster virtual machine.
Make the installer script executable.
chmod +x hivemigrator-remote-server-installer.sh
Run the installation command.
./hivemigrator-remote-server-installer.sh – --silent
Start the service.
service hivemigrator-remote-server start
Enter a Display Name.
Enter the hostname or IP address of the cluster edge node.
Enter the port for communication between the Hive Migrator service and the Dataproc server.
Choose whether to use TLS.
[Optional] - Configure a secure TLS/SSL connection to the agent.
- Under Secure Connection to a Metastore Agent, select Use Keystore for Certificates.
- Enter the following details:
- Keystore Type - Select JKS or PKCS12 as the keystore type.
- Keystore Path - Enter the path to the keystore file. For example,
/etc/wandisco/hivemigrator/agent/name/keystore.jks
. - Keystore Password - Enter the password for the keystore.
- Certificate Alias - Enter the alias of the certificate stored in the keystore.
- Trusted Certificate Chain Alias - Enter the alias of the trusted certificate chain stored in the keystore.
- Select Check connection to test the connection to the metastore with the details you entered.
If Data Migrator can connect to the remote agent successfully, you can continue configuring the agent.
Optional Settings:
- Configuration path
- Kerberos Configuration
- Use the principal assigned to the Dataproc cluster.
- Enter a default filesystem override to override the default filesystem URI. We recommend this for complex use cases only.
Select Save.
Configure Google Dataproc with the CLI
Command | Action |
---|---|
hive agent add dataproc | Add a Hive agent for a Google Cloud Dataproc Metastore |
hive agent configure datapropc | Change the configuration of an existing Hive agent for the Google Cloud Dataproc Metastore |
hive agent check | Check whether the Hive agent can connect to the Metastore |
hive agent delete | Delete a Hive agent |
hive agent list | List all configured Hive agents |
hive agent show | Show the configuration for a Hive agent |
hive agent types | List supported Hive agent types |