Skip to main content
Version: 3.2 (latest)

Databricks target prerequisites

When adding a Databricks Metastore Agent choose either a Unity Catalog agent or a Workspace Hive Metastore (Legacy) agent. Review and complete the following prerequisites for any Databricks agent. If you're adding a Workspace Hive Metastore (Legacy) agent, review the Workspace Hive Metastore (Legacy) agent prerequisites agents. If adding a Unity Catalog agent, review the Unity Catalog prerequisites.

caution

If you're upgrading from any Data Migrator version prior to 2.5, see this upgrade information if using Databricks agents.

Prerequisites for any Databricks agent

Then review Unity Catalog prerequisites or Workspace Hive Metastore (Legacy) prerequisites depending on your agent choice.

Data formats

To ensure a successful migration to Databricks, the source tables must be in one of the following formats:

  • ORC
  • PARQUET
  • Text
  • JSON
  • CSV (Struct and other complex hive types not supported in CSV)
  • AVRO (Limited support)

Cluster and file system

Ensure you have the following before you start:

If you are using Legacy Databricks:

If you are using Databricks Unity Catalog:

  • A Databricks SQL warehouse with at minimum, appropriate cluster size with autoscaling enabled (based on migration requirements) - Databricks cluster or other compute resources in Databricks is not supported.
  • Unity Catalog configuration: to manage access to all data as recommended by Databricks.

Install Databricks driver

To install the JDBC driver:

  1. Download the Databricks JDBC driver.
note

Data Migrator supports JDBC driver version 2.7.1 or higher.

  1. Unzip the package and upload the DatabricksJDBC42.jar file to the Data Migrator host machine.

  2. Move the DatabricksJDBC42.jar file to the Data Migrator directory:

    /opt/wandisco/hivemigrator/agent/databricks
  3. Change ownership of the jar file to the Hive Migrator system user and group:

    chown hive:hadoop /opt/wandisco/hivemigrator/agent/databricks/DatabricksJDBC42.jar

Concurrent thread configuration

Databricks cluster sizing, queuing, and autoscaling:

  • Cluster sizing (selected compute resource) is crucial to optimize migration performance with query complexity and the number of concurrent queries being key factors.
  • Upscaling of clusters is based on query throughput, the rate of incoming queries received from HVM, and the queue size. Recommendation is to start with a larger size and down size as needed.
  • HVM will run multiple queries at a time; recommendation is to add more clusters for autoscaling. Databricks adds clusters based on the time it would take to process all currently running queries, queued queries and incoming queries.
  • Databricks limits the number of queries on a cluster assigned to a SQL warehouse. To optimize migration performance and control concurrency, set the hivemigrator.databricks.threadcount based on your migration requirements to align with your specific Databricks environment.
  • Queue sizing. To optimize migration performance review cluster size and set the hivemigrator.databricks.copyintobatchsize based on your migration requirements.

Use the following databricks tools to monitor and evaluate cluster performance

  • Monitoring page: review the peak query count.
  • Query history & profiles: looking for errors & bytes spilled to disk above 1.

See Hive Migrator Databricks concurrent thread properties for descriptions and default values.

For more information on how warehouses are sized and how autoscaling works, see here.

Unity Catalog prerequisites

External location created in Databricks

When adding a Databricks Unity Catalog agent, you'll need to create your cloud storage external location in Databricks before adding your agent. See the following links for more information on creating external locations in Databricks.

Azure and Databricks

AWS and Databricks

GCP and Databricks.

Workspace Hive Metastore (Legacy) prerequisites

Cloud filesystem mounted in Databricks

When adding a Databricks Workspace Hive Metastore Legacy agent, you'll need to mount your cloud filesystem in Databricks before adding your agent. See the following links for more information on mounting storage on Databricks.

Azure blob storage.

S3 buckets and What is the Databricks File System.

Google Cloud Storage and What is the Databricks File System.

Databricks mount object storage to DBFS

Example: Script to mount ADLS Gen2 or blob storage with Azure Blob File System

configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add example-directory-name to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)

Replace:

  • <application-id> with the Application (client) ID for the Azure Active Directory application.
  • <scope-name> with the Databricks secret scope name.
  • <service-credential-key-name> with the name of the key containing the client secret.
  • <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
  • <container-name> with the name of a container in the ADLS Gen2 storage account.
  • <storage-account-name> with the ADLS Gen2 storage account name.
  • <mount-name> with the name of the intended mount point in DBFS.

Next steps