Configure Databricks as a target
If you use Databricks as a target metastore with Data Migrator, and have feedback to share, please contact Support. This feature currently does not support migrating transactional ACID tables, or INSERT_ONLY transactional tables regardless of agent type or conversion options.
If you're upgrading from any Data Migrator version prior to 2.5, see this upgrade information if using Databricks agents.
Migrations that include Hive constraints are not supported.
Configure a Databricks target metastore agent
Use Data Migrator to integrate with Databricks and migrate structured data from Hadoop to Databricks tables, including converting automatically from source Hive formats to Delta Lake format used in Databricks.
Configure a Databricks Unity Catalog agent or a Databricks Workspace Hive Metastore agent in Data Migrator using the UI or CLI, and connect it to your Databricks cluster.
Prerequisites
Review and complete the prerequisites section linked here before attempting to add a Databricks Metastore Agent.
Databricks Unity Catalog agent
When using the Unity Catalog metastore agent, Delta tables, when migrated, are created as external tables in Databricks. Other Source formats are created as managed Delta tables and data is converted and copied into the table.
To add a Databricks Unity Catalog agent.
From the Dashboard, select an instance under Instances.
Under Filesystems & Agents, select Metastore Agents.
Select Connect to Metastore.
Select the filesystem.
Select Databricks (Unity Catalog) as the Metastore Type.
Enter a Display Name.
Enter the JDBC Server Hostname, Port, and HTTP Path.
Enter the Databricks Access Token.
noteYou’ll need to re-enter the access token when updating this agent.
Enter the name of your Databricks Unity Catalog under Catalog.
Enter the full URI of your storage path from the external location configured in Databricks. Append the pre-populated URI under External Location.
infoExample: External location.abfss://file_system@account_name.dfs.core.windows.net/dir/subdir
Under Conversion, select Convert to Delta format (Optional) to convert tables to Delta Lake format and configure additional options.
Select Delete after conversion (Optional) to delete raw data after it has been converted to Delta format and migrated to Databricks.
infoOnly use this option if you're performing one-time migrations for the underlying table data. The Databricks agent doesn't support continuous (live) updates of table data if the data is deleted after conversion.
Select Table Type to specify how converted tables are migrated. Choose
Managed
to convert Hive source tables to managed delta orExternal
to convert Hive source tables to external delta. See the following links for more information on the concept of managed and external tables in Unity Catalog for Azure, AWS and GCS.If you select External, enter the full URI of the external location to store the tables converted to Delta Lake in the Converted data location field.
Example: Converted data locationabfss://file_system@account_name.dfs.core.windows.net/dir/converted_to_delta
infoSource delta tables are migrated as external tables regardless of Table Type selection.
Select Save to add the Metastore Agent.
Migration of tables created with UNIONs to Unity Catalog with conversion may result in inconsistent data, see the Known Issue for more information.
Databricks Workspace Hive Metastore agent
To add a Databricks Workspace Hive Metastore - Legacy metastore agent.
From the Dashboard, select an instance under Instances.
Under Filesystems & Agents, select Metastore Agents.
Select Connect to Metastore.
Select the filesystem.
Select Databricks (Workspace Hive Metastore - Legacy) as the Metastore Type.
Enter a Display Name.
Enter the JDBC Server Hostname, Port, and HTTP Path.
Enter the Databricks Access Token.
noteYou’ll need to re-enter the access token when updating this agent.
Enter the Filesystem Mount Point. Enter the mount point path of your cloud storage on your DBFS (Databricks File System). The filesystem must already be mounted on DBFS. This mount point value is required for the migration process.
Example: Mounted container's path/mnt/adls2/storage_account/
In the Default Filesystem Override field, enter the DBFS table location value in the format
dbfs:<location>
(with no trailing slash). If you intend to Convert to Delta format, enter the location on DBFS to store tables converted to Delta Lake. To store Delta Lake tables on cloud storage, enter the path to the mount point and the path on the cloud storage.Example: Using conversiondbfs:<converted_tables_path>
Example: Using conversion and cloud storagedbfs:<value of Filesystem Mount Point>/<converted_tables_path>
Example: Not using converstiondbfs:<value of Filesystem Mount Point>
Select Convert to Delta format to convert tables to Delta Lake format.
- Select Delete after conversion (Optional) to delete raw data from the Filesystem Mount Point location after it has been converted to Delta Lake format and migrated to Databricks.info
Only use this option if you're performing one-time migrations for the underlying table data. The Databricks agent doesn't support continuous (live) updates of table data if the data is deleted after conversion.
- Select Delete after conversion (Optional) to delete raw data from the Filesystem Mount Point location after it has been converted to Delta Lake format and migrated to Databricks.
Select Save to add the Metastore Agent.
Next steps
If you have already added Metadata Rules, create a Metadata Migration. When you create your migration, you can also override the existing agent configuration if required.
Databricks caching can result in data not being visible on the target. Refresh the cache by issuing a REFRESH TABLE command on the target. See the Databricks guide here to learn more.
Under certain conditions with a Databricks target, source truncate operations may take longer than expected. See the following Knowledge base article for more information.
If you're using Bulk Actions with Databricks migrations, see the following Known Issue.
Configure Databricks as a target with the CLI
Use the hive agent add databricks unity
CLI command to add a Databricks Unity Catalog agent or the hive agent add databricks legacy
CLI command to add a Databricks Workspace Hive Metastore - Legacy agent.
Databricks Unity Catalog agent with the CLI
To add Databricks a Unity Catalog target with the CLI, use the hive agent add databricks unity
command.
Examples
hive agent add databricks unity --name UnityExample --file-system-id FStarget1 --jdbc-server-hostname 123.azuredatabricks.net --jdbc-port 443 --jdbc-http-path sql/pro/o/2517/0417-19-example --access-token actoken123 --catalog cat1 --external-location abfss://container@account.dfs.core.windows.net --convert-to-delta TRUE --table-type EXTERNAL --converted-data-location abfss://container@account.dfs.core.windows.net/converted
hive agent add databricks unity --name UnityExample --file-system-id FStarget1 --jdbc-server-hostname 123.azuredatabricks.net --jdbc-port 443 --jdbc-http-path sql/pro/o/2517/0417-19-example --access-token actoken123 --catalog cat1 --external-location abfss://container@account.dfs.core.windows.net --convert-to-delta TRUE --table-type MANAGED
Databricks Workspace Hive Metastore target with the CLI
To add a Databricks Workspace Hive Metastore - Legacy agent with the CLI, use the hive agent add databricks legacy
command.
hive agent add databricks legacy --name LegacyExample --file-system-id fstarget --jdbc-server-hostname adb123.azuredatabricks.net --jdbc-port 443 --jdbc-http-path sql/protocolv1/o/8489/0234-127-example --access-token 123 --fs-mount-point /mnt/adls2/storage_account/ --convert-to-delta FALSE --default-fs-override dbfs:/mnt/adls2/storage_account
To ensure you see all of your migrated data in Databricks, set the value of --default-fs-override
to dbfs:/path/
and replace /path/
with the value from the --fs-mount-point
parameter.
--default-fs-override dbfs:/mnt/adls2/storage_account
Adjust Databricks target configuration with the CLI
After adding your Databricks agent, you can adjust it's configuration using the hive agent configure databricks unity
CLI command for a Unity Catalog agent or with the hive agent configure databricks legacy
CLI command for a Workspace Hive Metastore - Legacy agent.
After you've added a Databricks agent, you can still override its configuration when adding a new migration with the hive migration add databricks unity catalog
CLI command for Unity Catalog agents or with the hive migration add databricks legacy
CLI command for Workspace Hive Metastore - Legacy agents.
Next steps
Add metadata rules with the hive rule add
CLI command to define the scope then create a metadata migration with hive migration add
.
You can also create migrations and override existing Databricks agent properties with the hive migration add databricks unity
and hive migration add databricks legacy
CLI commands.
Unity Catalog without conversion partition considerations
If you're migrating without conversion to Unity Catalog, be aware of the following Databricks Unity Catalog limitations.
Custom partition schemes
Custom partition schemes created using commands like ALTER TABLE ADD PARTITION
are not supported for tables in Unity Catalog.
See Databricks Unity Catalog limitations for more information.
When you migrate a table (without conversion to Delta Lake format) with a custom partition scheme, Unity Catalog will be unable to query this table.
Unity Catalog can access tables that use directory-style partitioning. Adjusting the target data to directory-style partitioning allows partitions to be recognized by Unity Catalog.
You can optionally use Data Migrator to adjust the target data location during migration with path mapping. Mapping each source partition to a suitable target location will provide directory-style partitioning on the target recognizable by Unity Catalog.
Example path mapping:
Source path: /custom/partition/folder/path
Target path: /warehouse/tablespace/external/hive/mydb.db/table1/id2=2
Drop partition
Dropping partitions with commands like ALTER TABLE DROP PARTITION
are not supported in Unity Catalog.
Hivemigrator will not migrate these events resulting in inconsistent queries between source and target.
Databricks relies on the state of the table data to determine which partitions are present, since there's no change to the table data, Databricks still perceives the partition to be present.
In order to drop a partition fully, the table data must either be dropped or moved on the target. This can be done by dropping/moving the data on the source and relying on Data Migrator to migrate the change or by dropping/moving data at the target.
Subsequent Add partition
Adding new partitions with ADD PARTITION
is not supported in Unity Catalog.
An initial migration creates the table and any corresponding partitions with the data on the target (at the time of the metadata migration).
If subsequent partitions are migrated, there will be no ADD PARTITION
event on Databricks attempted by Hivemigrator. Querying the target table will produce unchanged results.
However, running REFRESH TABLE
will trigger Databricks to discover new directory-style partitions on the underlying data, and queries will produce up-to-date results (at least until there are new partitions).