Skip to main content

Tutorial: Iceberg Table Replication

Apache Iceberg has become a standard table format for large-scale analytics, but organisations rarely keep all their data in a single catalog. Production data may live in an AWS Glue catalog while analytics teams query a separate Hive Metastore. A primary data centre may ingest records that need to be available in a secondary region for disaster recovery. A cloud-native data platform may need to feed a subset of tables to an on-premises environment for low-latency processing.

Symphony solves these problems through the Ice Flow extension, which replicates Iceberg tables between any combination of supported catalogs—Hive, Hadoop, JDBC, AWS Glue, and REST—copying both metadata and data files so that the target catalog contains a fully independent, queryable copy of the source tables.

When to Use Iceberg Replication

Common scenarios where Iceberg replication is valuable:

  • Disaster recovery—maintain a hot standby of critical tables in a separate catalog and storage system, so that if the primary environment becomes unavailable, analytics workloads can fail over to the replica with minimal data loss.
  • Multi-cloud and hybrid access—replicate tables from a cloud-native catalog (such as AWS Glue) to an on-premises Hive Metastore, giving local compute engines direct access to the data without cross-network queries.
  • Augmenting compute capacity—copy tables to a second environment so that additional compute resources can process the data without competing for capacity in the primary cluster.
  • Development and testing—replicate a subset of production tables to a development catalog, giving teams realistic data to work with without risking the production environment.
  • Data sharing—publish tables to a partner or downstream team's catalog without granting them access to the source environment.

What You Will Learn

By the end of this tutorial you will be able to:

  • Register Iceberg catalogs as sources and targets in Symphony
  • Define which tables to replicate using scopes
  • Run a one-time replication between two catalogs
  • Monitor replication progress and inspect individual operations
  • Verify that source and target are consistent
  • Set up continuous replication for ongoing synchronisation

Prerequisites

Before starting, ensure you have:

  • A running Symphony instance with the Ice Flow extension installed and running
  • Access to at least two Iceberg catalogs (source and target)—these can be any combination of Hive, Hadoop, JDBC, AWS Glue, or REST catalogs
  • Connection details for each catalog (URIs, credentials, region, etc.)
  • The storage locations (S3 buckets, HDFS paths, etc.) where each catalog's data files reside
  • Network connectivity from Symphony to both catalog endpoints and their underlying storage

If you do not have two catalogs available, you can set up a local environment with two separate Hive Metastore instances or two REST catalog endpoints for experimentation.

Step 1: Register Warehouses

A warehouse in Ice Flow represents a storage location where Iceberg data files reside. You need to register at least the storage locations used by your source and target catalogs.

  1. Navigate to Iceberg > Warehouses in the sidebar.
  2. Click Add New Warehouse.
  3. Enter a descriptive Name (e.g., Production S3) and the storage Location (e.g., s3://production-data/warehouse).
  4. Click Save.

Repeat this for each distinct storage location. If your source catalog reads from s3://prod-bucket/iceberg and your target writes to s3://dr-bucket/iceberg, you need two warehouse entries.

Example warehouses:

NameLocation
Production S3s3://prod-bucket/iceberg
DR S3s3://dr-bucket/iceberg

Step 2: Configure Catalogs

Next, register your source and target Iceberg catalogs. Each catalog entry tells Ice Flow how to connect to the metadata store.

  1. Navigate to Iceberg > Catalogs.
  2. Click Add New Catalog.
  3. Select the catalog Type (Hive, Hadoop, JDBC, AWS Glue, or REST). The form will pre-populate the connection properties appropriate for the selected type.
  4. Enter a Name for the catalog (e.g., Production Hive).
  5. Select the Warehouse you registered in Step 1.
  6. Fill in the Catalog Properties. The required properties depend on the catalog type:

Catalog Properties by Type

Hive

  • uri—Thrift Metastore URI (e.g., thrift://metastore.example.com:9083)
  • catalog—Catalog name (if the metastore hosts multiple catalogs)

Hadoop

  • uri—HDFS nameservice URI (e.g., hdfs://nameservice1:8020)

JDBC

  • uri—JDBC connection string (e.g., jdbc:postgresql://db.example.com:5432/iceberg)

AWS Glue

  • uri—Glue endpoint
  • s3.access-key-id—AWS access key ID
  • s3.secret-access-key—AWS secret access key
  • aws.region—AWS region (e.g., us-east-1)

REST

  • uri—REST catalog endpoint (e.g., http://rest-catalog.example.com:8234/catalog)
  • s3.endpoint—S3-compatible storage endpoint (if applicable)

You can add additional properties using the Add property button. Consult your catalog provider's documentation for any additional properties required.

  1. Click Save.

Repeat this process for your second catalog (the target).

After saving both catalogs, you can verify that the connections are working by navigating to a catalog's detail page and selecting the Content tab. This opens the catalog explorer, where you can browse namespaces, list tables, view schemas, and preview sample data (up to 100 rows). If the content loads successfully, the catalog is properly configured.

Step 3: Define Scopes

Scopes define which tables to include in (or exclude from) replication. Each scope targets a single namespace and matches tables either by exact name or by regular expression pattern.

  1. Navigate to Iceberg > Scopes.
  2. Click Add New Scope.
  3. Enter a Name (e.g., All analytics tables).
  4. Enter the Namespace that contains the tables (e.g., analytics).
  5. Choose the Table Matching method:
    • By Name—matches a single table by its exact name (e.g., orders)
    • By Pattern—matches multiple tables using a regular expression (e.g., .* to match all tables in the namespace, or order.* to match all tables starting with "order")
  6. Click Save.

Example scopes:

NameNamespaceMatch TypeValue
All analytics tablesanalyticsPattern.*
Orders tablesalesNameorders
Customer tablessalesPatterncustomer.*

Create at least one inclusion scope. You can also create exclusion scopes to skip specific tables that match your inclusion criteria—for example, including all tables in a namespace but excluding temporary or staging tables.

Scopes are reusable across multiple replications and monitors, so it is worth naming them descriptively.

Step 4: Create a Replication

With warehouses, catalogs, and scopes configured, you can now set up the replication.

  1. Navigate to Iceberg > Replications.
  2. Click Create New Replication.
  3. Enter a Name (e.g., Production to DR).
  4. Select the Source Catalog and Target Catalog from the dropdowns. The target dropdown automatically excludes the selected source.
  5. Choose the replication Mode:
    • One-time—performs a single synchronisation and then stops. Available for all catalog types. Use this for initial loads, development snapshots, or ad-hoc copies.
    • Continuous—monitors the source for changes and replicates them as they occur. Only available when the source catalog is a Hive catalog (requires the Hive Metastore event stream). Use this for disaster recovery and ongoing synchronisation. If the source is not a Hive catalog, the mode is automatically set to one-time.
  6. Click Add Scope in the Inclusion section and select the scopes you created in Step 3. At least one inclusion scope is required. You can also add exclusion scopes to skip specific tables.
  7. Choose the Copy Type:
    • Latest snapshot—replicates only the current state of each table. This is faster and uses less storage, but the target will not retain the source's snapshot history.
    • All snapshots—replicates the complete snapshot history, preserving time-travel capabilities on the target. Use this when the target needs to support historical queries or rollback.
  8. Click Create.

The replication starts automatically. Its status changes from inactive to replicating, and for one-time replications it will change to complete when finished.

Step 5: Monitor Progress

Once a replication is running, you can monitor its progress in detail.

Replication Overview

Navigate to Iceberg > Replications and click on your replication. The detail page shows:

  • Status—the current state (inactive, replicating, complete)
  • Source and target—the catalog names
  • Mode—one-time or continuous
  • Scopes—the inclusion and exclusion scopes applied

Operations

Select the Operations tab to see individual table-level replication events. Each operation shows:

FieldDescription
NamespaceThe namespace containing the table
TableThe table name
TypeThe operation type
Start / EndWhen the operation started and completed
Data filesNumber of data files committed
Total dataTotal bytes transferred
DurationTime taken for the operation

Click on an individual operation to see its file transfers—the individual files copied from source to target storage, with their sizes and transfer times.

Scopes

Select the Scopes tab to review the inclusion and exclusion scopes applied to this replication.

Step 6: Verify Consistency

After a one-time replication completes, or at any point during continuous replication, you can verify that the source and target catalogs are consistent. Consistency checking compares the metadata in both catalogs to confirm that the target contains the same tables, schemas, and data as the source (within the defined scopes).

You can also verify consistency manually by using the catalog explorer (Iceberg > Catalogs > [your target catalog] > Content) to browse the replicated tables, inspect their schemas, and preview data to confirm it matches the source.

Step 7: Set Up Continuous Replication (Optional)

If your source catalog is a Hive catalog and you want ongoing synchronisation, create a new replication with the Continuous mode. Continuous replication monitors the Hive Metastore event stream for changes—new records, schema changes, new tables—and replicates them to the target as they occur.

This is the recommended approach for disaster recovery scenarios, where you want the target to stay as close to the source as possible. The replication runs indefinitely until you stop it.

You can also set up a Monitor to observe changes to a catalog without replicating them. Monitors are useful for understanding the rate and nature of changes before committing to a replication configuration.

Creating a Monitor

  1. Navigate to Iceberg > Monitors.
  2. Click Add New Monitor.
  3. Select the Source Catalog (must be a Hive catalog).
  4. Add inclusion scopes (and optionally exclusion scopes).
  5. Set the Poll Period in milliseconds (the default of 1000 ms checks for changes every second).
  6. Click Save.

The monitor's Events tab shows a timeline of detected changes, including record counts, data file counts and sizes, and whether the changes were additions or deletions. This information helps you understand the volume of changes flowing through your source catalog and plan your replication strategy accordingly.

Putting It All Together

Here is a complete example for a disaster recovery scenario:

  1. Register warehouses for production storage (s3://prod/warehouse) and DR storage (s3://dr/warehouse).
  2. Configure catalogs for the production Hive Metastore and the DR Hive Metastore.
  3. Create a scope that matches all tables in the core namespace using the pattern .*.
  4. Create a continuous replication from the production catalog to the DR catalog, with the scope you created, using the All snapshots copy type to preserve time-travel capability.
  5. Monitor the replication by checking the Operations tab to confirm that tables are being replicated and that file transfers are completing.

Once the initial synchronisation completes, the continuous replication will keep the DR environment up to date with ongoing changes. If the production environment becomes unavailable, analytics workloads can be redirected to the DR catalog, which contains a complete, queryable copy of the replicated tables.

Troubleshooting

  • Catalog connection fails—verify the URI, credentials, and network connectivity. Use the catalog explorer (Content tab) to test the connection. For AWS Glue catalogs, ensure the access key, secret key, and region are correct.
  • No tables found in a namespace—confirm the namespace name matches exactly (names are case-sensitive). Use the catalog explorer to browse available namespaces.
  • Replication stays in "replicating" state—for one-time replications, check the Operations tab for errors. For continuous replications, this is the expected state—the replication runs indefinitely.
  • Continuous mode is unavailable—continuous replication requires a Hive catalog as the source, because it relies on the Hive Metastore event stream to detect changes. For non-Hive sources, use one-time replication.
  • Missing data files on the target—verify that the target warehouse storage location is accessible and writable. Check that any required cloud credentials are configured in the target catalog properties.

Next Steps

  • Explore the Ice Flow extension reference at Extensions > Ice Flow for detailed documentation of all configuration options, catalog types, and replication modes.
  • Review the Configuration Reference to understand how Symphony itself is configured and operated.