Skip to main content
Version: 3.0 (latest)

Databricks target prerequisites

When adding a Databricks Metastore Agent choose either a Unity Catalog agent or a Workspace Hive Metastore (Legacy) agent. Review and complete the following prerequisites for any Databricks agent. If you're adding a Workspace Hive Metastore (Legacy) agent, review the Workspace Hive Metastore (Legacy) agent prerequisites agents. If adding a Unity Catalog agent, review the Unity Catalog prerequisites.

caution

If you're upgrading from any Data Migrator version prior to 2.5, see this upgrade information if using Databricks agents.

Prerequisites for any Databricks agent

Then review Unity Catalog prerequisites or Workspace Hive Metastore (Legacy) prerequisites depending on your agent choice.

Data formats

To ensure a successful migration to Databricks, the source tables must be in one of the following formats:

  • CSV
  • JSON
  • AVRO
  • ORC
  • PARQUET
  • Text

Cluster and file system

Ensure you have the following before you start:

Install Databricks driver

To install the JDBC driver:

  1. Download the Databricks JDBC driver.
note

Data Migrator supports JDBC driver version 2.6.25 or higher.

  1. Unzip the package and upload the DatabricksJDBC42.jar file to the Data Migrator host machine.

  2. Move the DatabricksJDBC42.jar file to the Data Migrator directory:

    /opt/wandisco/hivemigrator/agent/databricks
  3. Change ownership of the jar file to the Hive Migrator system user and group:

    chown hive:hadoop /opt/wandisco/hivemigrator/agent/databricks/DatabricksJDBC42.jar

Concurrent thread configuration

"Databricks limits concurrent queries per cluster to 10". To optimize migration performance and control concurrency, set the hivemigrator.databricks.threadcount based on your migration requirements to align with your specific Databricks environment.

See Hive Migrator Databricks concurrent thread properties for descriptions and default values.

If you need to increase concurrency beyond the current default values contact Support for more info.

Unity Catalog prerequisites

External location created in Databricks

When adding a Databricks Unity Catalog agent, you'll need to create your cloud storage external location in Databricks before adding your agent. See the following links for more information on creating external locations in Databricks.

Azure and Databricks

AWS and Databricks

GCP and Databricks.

Workspace Hive Metastore (Legacy) prerequisites

Cloud filesystem mounted in Databricks

When adding a Databricks Workspace Hive Metastore Legacy agent, you'll need to mount your cloud filesystem in Databricks before adding your agent. See the following links for more information on mounting storage on Databricks.

Azure blob storage.

S3 buckets and What is the Databricks File System.

Google Cloud Storage and What is the Databricks File System.

Databricks mount object storage to DBFS

Example: Script to mount ADLS Gen2 or blob storage with Azure Blob File System

configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add example-directory-name to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)

Replace:

  • <application-id> with the Application (client) ID for the Azure Active Directory application.
  • <scope-name> with the Databricks secret scope name.
  • <service-credential-key-name> with the name of the key containing the client secret.
  • <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
  • <container-name> with the name of a container in the ADLS Gen2 storage account.
  • <storage-account-name> with the ADLS Gen2 storage account name.
  • <mount-name> with the name of the intended mount point in DBFS.

Next steps