Skip to main content
Version: 2.5.4

Configure Databricks as a target

note

If you use Databricks as a target metastore with Data Migrator, and have feedback to share, please contact Support. This feature currently does not support migrating transactional ACID tables, or INSERT_ONLY transactional tables regardless of agent type or conversion options.

caution

If you're upgrading from any Data Migrator version prior to 2.5, see this upgrade information if using Databricks agents.

Constraints

Migrations that include Hive constraints are not supported.

Configure a Databricks target metastore agent

Use Data Migrator to integrate with Databricks and migrate structured data from Hadoop to Databricks tables, including converting automatically from source Hive formats to Delta Lake format used in Databricks.

Configure a Databricks Unity Catalog agent or a Databricks Workspace Hive Metastore agent in Data Migrator using the UI or CLI, and connect it to your Databricks cluster.

Prerequisites

caution

Review and complete the prerequisites section linked here before attempting to add a Databricks Metastore Agent.

Databricks Unity Catalog agent

note

When using the Unity Catalog metastore agent, Delta tables, when migrated, are created as external tables in Databricks. Other Source formats are created as managed Delta tables and data is converted and copied into the table.

To add a Databricks Unity Catalog agent.

  1. From the Dashboard, select an instance under Instances.

  2. Under Filesystems & Agents, select Metastore Agents.

  3. Select Connect to Metastore.

  4. Select the filesystem.

  5. Select Databricks (Unity Catalog) as the Metastore Type.

  6. Enter a Display Name.

  7. Enter the JDBC Server Hostname, Port, and HTTP Path.

  8. Enter the Databricks Access Token.

    note

    You’ll need to re-enter the access token when updating this agent.

  9. Enter the name of your Databricks Unity Catalog under Catalog.

  10. Enter the full URI of your storage path from the external location configured in Databricks. Append the pre-populated URI under External Location.

    info

    Ensure the external location you specify has already been created in Databricks. Learn more from Azure, AWS and GCP.

    Example: External location.
    abfss://file_system@account_name.dfs.core.windows.net/dir/subdir
  11. Under Conversion, select Convert to Delta format (Optional) to convert tables to Delta Lake format and configure additional options.

    1. Select Delete after conversion (Optional) to delete raw data after it has been converted to Delta format and migrated to Databricks.

      info

      Only use this option if you're performing one-time migrations for the underlying table data. The Databricks agent doesn't support continuous (live) updates of table data if the data is deleted after conversion.

    2. Select Table Type to specify how converted tables are migrated. Choose Managed to convert Hive source tables to managed delta or External to convert Hive source tables to external delta. See the following links for more information on the concept of managed and external tables in Unity Catalog for Azure, AWS and GCS.

      1. If you select External, enter the full URI of the external location to store the tables converted to Delta Lake in the Converted data location field.

        Example: Converted data location
        abfss://file_system@account_name.dfs.core.windows.net/dir/converted_to_delta
      info

      Source delta tables are migrated as external tables regardless of Table Type selection.

  12. Select Save to add the Metastore Agent.

note

Migration of tables created with UNIONs to Unity Catalog with conversion may result in inconsistent data, see the Known Issue for more information.

Databricks Workspace Hive Metastore agent

To add a Databricks Workspace Hive Metastore - Legacy metastore agent.

  1. From the Dashboard, select an instance under Instances.

  2. Under Filesystems & Agents, select Metastore Agents.

  3. Select Connect to Metastore.

  4. Select the filesystem.

  5. Select Databricks (Workspace Hive Metastore - Legacy) as the Metastore Type.

  6. Enter a Display Name.

  7. Enter the JDBC Server Hostname, Port, and HTTP Path.

  8. Enter the Databricks Access Token.

    note

    You’ll need to re-enter the access token when updating this agent.

  9. Enter the Filesystem Mount Point. Enter the mount point path of your cloud storage on your DBFS (Databricks File System). The filesystem must already be mounted on DBFS. This mount point value is required for the migration process.

    Example: Mounted container's path
    /mnt/adls2/storage_account/
    info

    Learn more on mounting storage on Databricks for ADLS/S3/GCP filesystems.

  10. In the Default Filesystem Override field, enter the DBFS table location value in the format dbfs:<location> (with no trailing slash). If you intend to Convert to Delta format, enter the location on DBFS to store tables converted to Delta Lake. To store Delta Lake tables on cloud storage, enter the path to the mount point and the path on the cloud storage.

    Example: Using conversion
    dbfs:<converted_tables_path>
    Example: Using conversion and cloud storage
    dbfs:<value of Filesystem Mount Point>/<converted_tables_path>
    Example: Not using converstion
    dbfs:<value of Filesystem Mount Point>
  11. Select Convert to Delta format to convert tables to Delta Lake format.

    1. Select Delete after conversion (Optional) to delete raw data from the Filesystem Mount Point location after it has been converted to Delta Lake format and migrated to Databricks.
      info

      Only use this option if you're performing one-time migrations for the underlying table data. The Databricks agent doesn't support continuous (live) updates of table data if the data is deleted after conversion.

  12. Select Save to add the Metastore Agent.

Next steps

If you have already added Metadata Rules, create a Metadata Migration. When you create your migration, you can also override the existing agent configuration if required.

tip

Databricks caching can result in data not being visible on the target. Refresh the cache by issuing a REFRESH TABLE command on the target. See the Databricks guide here to learn more.

info

Under certain conditions with a Databricks target, source truncate operations may take longer than expected. See the following Knowledge base article for more information.

Configure Databricks as a target with the CLI

Use the hive agent add databricks unity CLI command to add a Databricks Unity Catalog agent or the hive agent add databricks legacy CLI command to add a Databricks Workspace Hive Metastore - Legacy agent.

Databricks Unity Catalog agent with the CLI

To add Databricks a Unity Catalog target with the CLI, use the hive agent add databricks unity command.

Examples

Example for Unity Catalog Databricks agent with external table type
hive agent add databricks unity --name UnityExample --file-system-id FStarget1  --jdbc-server-hostname 123.azuredatabricks.net --jdbc-port 443 --jdbc-http-path sql/pro/o/2517/0417-19-example --access-token actoken123 --catalog cat1 --external-location abfss://container@account.dfs.core.windows.net --convert-to-delta TRUE --table-type EXTERNAL --converted-data-location abfss://container@account.dfs.core.windows.net/converted
Example for Unity Catalog Databricks agent with managed table type
hive agent add databricks unity --name UnityExample --file-system-id FStarget1  --jdbc-server-hostname 123.azuredatabricks.net --jdbc-port 443 --jdbc-http-path sql/pro/o/2517/0417-19-example --access-token actoken123 --catalog cat1 --external-location abfss://container@account.dfs.core.windows.net --convert-to-delta TRUE --table-type MANAGED

Databricks Workspace Hive Metastore target with the CLI

To add a Databricks Workspace Hive Metastore - Legacy agent with the CLI, use the hive agent add databricks legacy command.

Example for Workspace Hive Metastore (Legacy) Databricks agent
hive agent add databricks legacy --name LegacyExample --file-system-id fstarget --jdbc-server-hostname adb123.azuredatabricks.net --jdbc-port 443 --jdbc-http-path sql/protocolv1/o/8489/0234-127-example --access-token 123 --fs-mount-point /mnt/adls2/storage_account/ --convert-to-delta FALSE --default-fs-override dbfs:/mnt/adls2/storage_account 
note

To ensure you see all of your migrated data in Databricks, set the value of --default-fs-override to dbfs:/path/ and replace /path/ with the value from the --fs-mount-point parameter.

--default-fs-override dbfs:/mnt/adls2/storage_account

Adjust Databricks target configuration with the CLI

After adding your Databricks agent, you can adjust it's configuration using the hive agent configure databricks unity CLI command for a Unity Catalog agent or with the hive agent configure databricks legacy CLI command for a Workspace Hive Metastore - Legacy agent.

tip

After you've added a Databricks agent, you can still override its configuration when adding a new migration with the hive migration add databricks unity catalog CLI command for Unity Catalog agents or with the hive migration add databricks legacy CLI command for Workspace Hive Metastore - Legacy agents.

Next steps

Add metadata rules with the hive rule add CLI command to define the scope then create a metadata migration with hive migration add.
You can also create migrations and override existing Databricks agent properties with the hive migration add databricks unity and hive migration add databricks legacy CLI commands.

Unity Catalog without conversion partition considerations

If you're migrating without conversion to Unity Catalog, be aware of the following Databricks Unity Catalog limitations.

Custom partition schemes

Custom partition schemes created using commands like ALTER TABLE ADD PARTITION are not supported for tables in Unity Catalog. See Databricks Unity Catalog limitations for more information.

When you migrate a table (without conversion to Delta Lake format) with a custom partition scheme, Unity Catalog will be unable to query this table.

Unity Catalog can access tables that use directory-style partitioning. Adjusting the target data to directory-style partitioning allows partitions to be recognized by Unity Catalog.

You can optionally use Data Migrator to adjust the target data location during migration with path mapping. Mapping each source partition to a suitable target location will provide directory-style partitioning on the target recognizable by Unity Catalog.

Example path mapping:

Source path: /custom/partition/folder/path

Target path: /warehouse/tablespace/external/hive/mydb.db/table1/id2=2

Drop partition

Dropping partitions with commands like ALTER TABLE DROP PARTITION are not supported in Unity Catalog. Hivemigrator will not migrate these events resulting in inconsistent queries between source and target. Databricks relies on the state of the table data to determine which partitions are present, since there's no change to the table data, Databricks still perceives the partition to be present.

In order to drop a partition fully, the table data must either be dropped or moved on the target. This can be done by dropping/moving the data on the source and relying on Data Migrator to migrate the change or by dropping/moving data at the target.

Subsequent Add partition

Adding new partitions with ADD PARTITION is not supported in Unity Catalog. An initial migration creates the table and any corresponding partitions with the data on the target (at the time of the metadata migration). If subsequent partitions are migrated, there will be no ADD PARTITION event on Databricks attempted by Hivemigrator. Querying the target table will produce unchanged results. However, running REFRESH TABLE will trigger Databricks to discover new directory-style partitions on the underlying data, and queries will produce up-to-date results (at least until there are new partitions).