Databricks target prerequisites
When adding a Databricks Metastore Agent choose either a Unity Catalog agent or a Workspace Hive Metastore (Legacy) agent. Review and complete the following prerequisites for any Databricks agent. If you're adding a Workspace Hive Metastore (Legacy) agent, review the Workspace Hive Metastore (Legacy) agent prerequisites agents. If adding a Unity Catalog agent, review the Unity Catalog prerequisites.
If you're upgrading from any Data Migrator version prior to 2.5, see this upgrade information if using Databricks agents.
Prerequisites for any Databricks agent
- Data formats
- Databricks Cluster and file system
- Install Databricks driver
- Concurrent thread configuration
Then review Unity Catalog prerequisites or Workspace Hive Metastore (Legacy) prerequisites depending on your agent choice.
Data formats
To ensure a successful migration to Databricks, the source tables must be in one of the following formats:
- CSV
- JSON
- AVRO
- ORC
- PARQUET
- Text
Cluster and file system
Ensure you have the following before you start:
- A Databricks cluster with at minimum, databricks runtime 15.1.
- A Databricks File System (DBFS)
Install Databricks driver
To install the JDBC driver:
- Download the Databricks JDBC driver.
Data Migrator supports JDBC driver version 2.6.25 or higher.
Unzip the package and upload the
DatabricksJDBC42.jar
file to the Data Migrator host machine.Move the
DatabricksJDBC42.jar
file to the Data Migrator directory:/opt/wandisco/hivemigrator/agent/databricks
Change ownership of the jar file to the Hive Migrator system user and group:
chown hive:hadoop /opt/wandisco/hivemigrator/agent/databricks/DatabricksJDBC42.jar
Concurrent thread configuration
"Databricks limits concurrent queries per cluster to 10".
To optimize migration performance and control concurrency, set the hivemigrator.databricks.threadcount
based on your migration requirements to align with your specific Databricks environment.
See Hive Migrator Databricks concurrent thread properties for descriptions and default values.
If you need to increase concurrency beyond the current default values contact Support for more info.
Unity Catalog prerequisites
External location created in Databricks
When adding a Databricks Unity Catalog agent, you'll need to create your cloud storage external location in Databricks before adding your agent. See the following links for more information on creating external locations in Databricks.
Workspace Hive Metastore (Legacy) prerequisites
Cloud filesystem mounted in Databricks
When adding a Databricks Workspace Hive Metastore Legacy agent, you'll need to mount your cloud filesystem in Databricks before adding your agent. See the following links for more information on mounting storage on Databricks.
S3 buckets and What is the Databricks File System.
Google Cloud Storage and What is the Databricks File System.
Databricks mount object storage to DBFS
Example: Script to mount ADLS Gen2 or blob storage with Azure Blob File System
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add example-directory-name to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Replace:
- <application-id> with the Application (client) ID for the Azure Active Directory application.
- <scope-name> with the Databricks secret scope name.
- <service-credential-key-name> with the name of the key containing the client secret.
- <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
- <container-name> with the name of a container in the ADLS Gen2 storage account.
- <storage-account-name> with the ADLS Gen2 storage account name.
- <mount-name> with the name of the intended mount point in DBFS.
Next steps
- Continue to add a Databricks target metastore agent.