Configure Databricks as a target
Databricks is currently available as a preview feature and under development. If you use Databricks as a target metastore with Data Migrator, and have feedback to share, please contact Support.
This feature currently does not support migrating transactional ACID tables, or INSERT_ONLY transactional tables regardless conversion options.
The feature is automatically enabled.
Currently, HVM does not support migrating source delta lake tables to the Databricks agent with unity catalog enabled.
If you have delta lake tables on source and need to migrate them to Databricks, unity catalog should be disabled on the agent.
Migrations that include Hive constraints are not supported.
Configure a Databricks metadata agent
Use Data Migrator to integrate with Databricks and migrate structured data from Hadoop to Databricks tables, including converting automatically from source Hive formats to Delta Lake format used in Databricks.
Configure a Databricks metadata agent in Data Migrator using the UI or CLI, and connect it to your Databricks cluster.
Prerequisites
To ensure a successful migration to Databricks, the source tables must be in one of the following formats:
- CSV
- JSON
- AVRO
- ORC
- PARQUET
- Text
Ensure you have the following before you start:
- A Databricks cluster
- A Databricks File System (DBFS)
- Cloud storage mounted onto the DBFS
- Install Databricks driver
Example: Script to mount ADLS Gen2 or blob storage with Azure Blob File System
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add example-directory-name to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Replace:
- <application-id> with the Application (client) ID for the Azure Active Directory application.
- <scope-name> with the Databricks secret scope name.
- <service-credential-key-name> with the name of the key containing the client secret.
- <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
- <container-name> with the name of a container in the ADLS Gen2 storage account.
- <storage-account-name> with the ADLS Gen2 storage account name.
- <mount-name> with the name of the intended mount point in DBFS.
Install Databricks driver
To install the JDBC driver:
Download the Databricks JDBC driver.
noteData Migrator only supports JDBC driver version 2.6.25 or higher.
Unzip the package and upload the
DatabricksJDBC42.jar
file to the Data Migrator host machine.Move the
DatabricksJDBC42.jar
file to the Data Migrator directory:/opt/wandisco/hivemigrator/agent/databricks
Change ownership of the jar file to the Hive Migrator system user and group:
chown hive:hadoop /opt/wandisco/hivemigrator/agent/databricks/DatabricksJDBC42.jar
Configure Databricks as a target with the UI
If migrating to Unity Catalog, Data Migrator only supports migrating source hive tables into managed delta tables on Databricks.
To add Databricks from your Dashboard:
From the Dashboard, select an instance under Instances.
Under Filesystems & Agents, select Metastore Agents.
Select Connect to Metastore.
Select the filesystem.
Select Databricks as the Metastore Type.
Enter a Display Name.
Enter the JDBC Server Hostname, Port, and HTTP Path.
Enter the Databricks Access Token.
noteYou’ll need to reenter the access token when updating this agent.
(Optional) - Enter the name of your Databricks Unity Catalog.
infoYou can't update an agent's Unity Catalog while it's in an active migration.
Select Convert to Delta Lake if you want to convert your tables.
Select Delete after conversion to delete the underlying table data and metadata from the Filesystem Mount Point location after it has been converted to Delta Lake in Databricks.
infoOnly use this option if you're performing one-time migrations for the underlying table data. The Databricks agent doesn't support continuous (live) updates of table data if you're converting to Delta Lake in Databricks.
Enter the Filesystem Mount Point.
The filesystem that contains your data you want to migrate must be mounted onto your DBFS.
Enter the mounted container's path on the DBFS.Example: Mounted container's path/mounted/container/path
Enter path for Default Filesystem Override.
If you select Convert to Delta Lake , enter the location on the DBFS to store the tables converted to Delta Lake. To store Delta Lake tables on cloud storage, enter the path to the mount point and the path on the cloud storage.
Example: Location on the DBFS to store tables converted to Delta Lakedbfs:<location>
Example: Cloud storage locationdbfs:/mnt/adls2/storage_account/delta_tables
If you don't select Convert to Delta, enter the mount point.
Example: Filesystem mount pointdbfs:<value of Fs mount point>
Select Save.
Next steps
Create a metadata migration using the Databricks agent you just configured.
Monitor the following from the Dashboard:
- The progress of the migration.
- The status of the migration.
- The health of your agent connection.
To view the connection status:
- Select Check status from the ellipsis.
- Select Settings
- Select View agent.
Databricks caching can result in data not being visible on the target. Refresh the cache by issuing a REFRESH TABLE command on the target. See the Databricks guide here to learn more.
Configure Databricks as a target with the CLI
To add Databricks as a metadata agent with the CLI, use the commands below and refer to the example.
Example: Configure a Databricks metadata agent with the CLI
hive agent add databricks --name databricksAgent --jdbc-server-hostname mydbcluster.cloud.databricks.com --jdbc-port 443 --jdbc-http-path sql/protocolv1/o/8445611123456789/0234-125567-testy978 --access-token daexamplefg123456789t6f0b57dfdtoken4 --file-system-id mys3bucket --default-fs-override dbfs:/mnt/mybucketname --fs-mount-point /mnt/mybucket --convert-to-delta
Command | Action |
---|---|
hive agent add databricks | Add an agent for a Databricks metastore. |
hive agent configure databricks | Update the configuration of an existing agent for the Databricks metastore. |
hive agent check | Check whether the agent can connect to the metastore. |
hive agent delete | Delete an agent. |
hive agent list | List all configured agents. |
hive agent show | View the configuration for an agent. |
hive agent types | List supported agent types. |
hive agent add databricks
and hive agent configure databricks
Mandatory parameters
Parameter | Description |
---|---|
--name | The ID to give to the new Hive agent. |
--jdbc-server-hostname | The JDBC server hostname for the Databricks cluster (AWS, Azure, or GCP). |
--jdbc-port | The port used for JDBC connections to the Databricks cluster (AWS, Azure, or GCP). |
--jdbc-http-path | The HTTP path for the Databricks cluster (AWS, Azure, or GCP). |
--access-token | The personal access token to be used for the Databricks cluster (AWS, Azure, or GCP). |
--file-system-id | Enter a name for the filesystem you want to associate with this agent (for example, myadls2 or mys3bucket ). This ensures any path mappings are correctly linked between the filesystem and the agent. |
--default-fs-override | Enter an override for the default filesystem URI instead of a filesystem name (for example, dbfs: ). |
Use either the --file-system-id
or the --default-fs-override
parameter but not both.
To convert to Delta Lake format, use the --convert-to-delta
parameter.
Set the value of the --default-fs-override
parameter to dbfs:
or a path in the Databricks filesystem. For example, dbfs:/mount/externalStorage
.
Optional parameters
Parameter | Description |
---|---|
--fs-mount-point | Define the location in the Databricks filesystem to mount the filesystem (like Azure Data Lake Storage Gen2, S3, or Google Cloud Storage) that contains the data you want to migrate. If you use the --convert-to-delta option, the data is converted and placed in the location on the Databricks filesystem defined by --default-fs-override . If you don't use the --convert-to-delta option, the data is migrated directly from the location defined by the --fs-mount-point . |
--convert-to-delta | All underlying table data and metadata is migrated to the filesystem location defined by the --fs-mount-point parameter. Use this option to convert the associated data and metadata to Delta Lake format in Databricks (AWS, Azure or GCP). |
--delete-after-conversion | Use this parameter if you used the parameter --convert-to-delta . --delete-after-conversion deletes the underlying table data and metadata from the filesystem location (defined by --fs-mount-point ) after it has been converted to Delta Lake in Databricks. Only use this option if you're performing one-time migrations for the underlying table data. The Databricks agent doesn't support continuous (live) updates of table data if you're converting to Delta Lake in Databricks. |
--catalog | Enter the name of your Databricks Unity Catalog. |
To ensure you see all of your migrated data in Databricks, set the value of default-fs-override
to dbfs:/path/
and replace /path/
with the value from the --fs-mount-point
parameter.
--default-fs-override dbfs:/mnt/mybucketname
Next steps
Create a metadata migration with the CLI using the Databricks target agent you just configured.