Databricks target prerequisites
When adding a Databricks Metastore Agent choose either a Unity Catalog agent or a Workspace Hive Metastore (Legacy) agent. Review and complete the following prerequisites for any Databricks agent. If you're adding a Workspace Hive Metastore (Legacy) agent, review the Workspace Hive Metastore (Legacy) agent prerequisites agents. If adding a Unity Catalog agent, review the Unity Catalog prerequisites.
If you're upgrading from any Data Migrator version prior to 2.5, see this upgrade information if using Databricks agents.
Prerequisites for any Databricks agent
- Data formats
- Databricks Cluster and file system
- Install Databricks driver
- Concurrent thread configuration
Then review Unity Catalog prerequisites or Workspace Hive Metastore (Legacy) prerequisites depending on your agent choice.
Data formats
To ensure a successful migration to Databricks, the source tables must be in one of the following formats:
- ORC
- PARQUET
- Text
- JSON
- CSV (Struct and other complex hive types not supported in CSV)
- AVRO (Limited support)
Cluster and file system
Ensure you have the following before you start:
If you are using Legacy Databricks:
- A Databricks cluster (minimum databricks runtime 15.1) or a Databricks SQL warehouse.
- A Databricks File System (DBFS).
If you are using Databricks Unity Catalog:
- A Databricks SQL warehouse with at minimum, appropriate cluster size with autoscaling enabled (based on migration requirements) - Databricks cluster or other compute resources in Databricks is not supported.
- Unity Catalog configuration: to manage access to all data as recommended by Databricks.
Install Databricks driver
To install the JDBC driver:
- Download the Databricks JDBC driver.
Data Migrator supports JDBC driver version 2.7.1 or higher.
Unzip the package and upload the
DatabricksJDBC42.jar
file to the Data Migrator host machine.Move the
DatabricksJDBC42.jar
file to the Data Migrator directory:/opt/wandisco/hivemigrator/agent/databricks
Change ownership of the jar file to the Hive Migrator system user and group:
chown hive:hadoop /opt/wandisco/hivemigrator/agent/databricks/DatabricksJDBC42.jar
Concurrent thread configuration
Databricks cluster sizing, queuing, and autoscaling:
- Cluster sizing (selected compute resource) is crucial to optimize migration performance with query complexity and the number of concurrent queries being key factors.
- Upscaling of clusters is based on query throughput, the rate of incoming queries received from HVM, and the queue size. Recommendation is to start with a larger size and down size as needed.
- HVM will run multiple queries at a time; recommendation is to add more clusters for autoscaling. Databricks adds clusters based on the time it would take to process all currently running queries, queued queries and incoming queries.
- Databricks limits the number of queries on a cluster assigned to a SQL warehouse. To optimize migration performance and control concurrency, set the
hivemigrator.databricks.threadcount
based on your migration requirements to align with your specific Databricks environment. - Queue sizing. To optimize migration performance review cluster size and set the
hivemigrator.databricks.copyintobatchsize
based on your migration requirements.
Use the following databricks tools to monitor and evaluate cluster performance
- Monitoring page: review the peak query count.
- Query history & profiles: looking for errors & bytes spilled to disk above 1.
See Hive Migrator Databricks concurrent thread properties for descriptions and default values.
For more information on how warehouses are sized and how autoscaling works, see here.
Unity Catalog prerequisites
External location created in Databricks
When adding a Databricks Unity Catalog agent, you'll need to create your cloud storage external location in Databricks before adding your agent. See the following links for more information on creating external locations in Databricks.
Workspace Hive Metastore (Legacy) prerequisites
Cloud filesystem mounted in Databricks
When adding a Databricks Workspace Hive Metastore Legacy agent, you'll need to mount your cloud filesystem in Databricks before adding your agent. See the following links for more information on mounting storage on Databricks.
S3 buckets and What is the Databricks File System.
Google Cloud Storage and What is the Databricks File System.
Databricks mount object storage to DBFS
Example: Script to mount ADLS Gen2 or blob storage with Azure Blob File System
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add example-directory-name to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Replace:
- <application-id> with the Application (client) ID for the Azure Active Directory application.
- <scope-name> with the Databricks secret scope name.
- <service-credential-key-name> with the name of the key containing the client secret.
- <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
- <container-name> with the name of a container in the ADLS Gen2 storage account.
- <storage-account-name> with the ADLS Gen2 storage account name.
- <mount-name> with the name of the intended mount point in DBFS.
Next steps
- Continue to add a Databricks target metastore agent.