Skip to main content

Core Concepts

Ice Flow manages Apache Iceberg catalogs as a service within Symphony. This page explains the key concepts and how they relate to each other.


Catalogs

A catalog is a connection to an Iceberg metadata store. The metadata store tracks which tables exist, their schemas, and where their data files are located. Ice Flow supports six catalog types:

TypeMetadata store
HadoopFilesystem-based metadata (HDFS)
HiveApache Hive Metastore (Thrift)
JDBCRelational database (PostgreSQL, MySQL, etc.)
AWS GlueAmazon Glue Data Catalog
RESTApache Iceberg REST Catalog API
NessieProject Nessie versioned catalog

Each catalog is associated with a warehouse and has its own set of connection properties. You can connect multiple catalogs simultaneously, even of different types.

Catalogs that have active monitors or replications cannot be deleted until those associations are removed.


Warehouses

A warehouse is a storage location where Iceberg table data files reside. Every catalog is associated with a warehouse. Common warehouse locations include S3 buckets, HDFS paths, Azure Blob containers, and GCS buckets.

Warehouses serve two purposes:

  1. Catalog association — tells Ice Flow where a catalog's data lives
  2. Location mappings — enables path translation during replication between catalogs with different storage backends

Scopes

A scope is a reusable table selector. Scopes target a single namespace and match tables by exact name or regular expression pattern:

  • By Name — matches one table exactly (e.g. orders)
  • By Pattern — matches tables by regex (e.g. .* for all tables, order.* for tables starting with "order")

Scopes are defined independently and then attached to monitors or replications as inclusion scopes (tables to process) or exclusion scopes (tables to skip). This separation means you define a pattern once and reuse it across multiple monitors and replications.

Namespace names are case-sensitive and must match exactly as they appear in the catalog.


Monitors

A monitor continuously observes changes to tables in a Hive catalog. It polls the Hive Metastore event stream at a configurable interval and records detected changes (creates, modifications, renames, deletes).

Monitors are passive — they record events but do not copy data. They are useful for:

  • Understanding change patterns before setting up replication
  • Auditing what has changed and when
  • Alerting on unexpected changes

Monitors require a Hive catalog as the source because they rely on the Hive Metastore event stream.


Replications

A replication copies Iceberg tables from a source catalog to a target catalog, including both metadata and data files. The target receives a fully independent, queryable copy.

Modes

ModeBehaviour
One-timePerforms a single sync pass and stops. Works with all catalog types.
ContinuousWatches for changes and replicates them as they occur. Requires a Hive source.

Copy Strategies

StrategyBehaviour
Latest snapshotCopies only the current table state. Faster and uses less storage.
All snapshotsCopies the complete snapshot history, preserving time-travel capabilities.

Operations and File Transfers

Each table synchronised during a replication cycle is recorded as an operation. Each operation may involve copying one or more data files, recorded as file transfers. This two-level tracking lets you see both the high-level progress (which tables were synced) and the low-level detail (which files were copied, their sizes, and transfer durations).


Location Mappings

A location mapping defines how file paths are translated when replicating between catalogs with different storage locations. Each mapping specifies:

  • A source warehouse and source path
  • A target warehouse and target path

During replication, Ice Flow rewrites file paths so that data files land in the correct location on the target storage system. For example, a mapping might translate s3://prod-bucket/warehouse to s3://dr-bucket/warehouse.

Location mappings are global — they are not tied to a specific replication. You create mappings once and associate them with replications as needed.

Mappings are path-scoped: a mapping applies only to tables whose source location falls under its source path. Ice Flow replicates a table that matches no mapping using the catalog's default warehouse, so an unrelated mapping never disrupts other replications.


How They Fit Together

Warehouse ──── Catalog ──── Scope ──── Monitor
│ │ │
│ │ └──── Replication
│ │ │
└── Location Mapping ────────────────┘
  1. You define warehouses to describe where data lives
  2. You create catalogs that connect to metadata stores and reference a warehouse
  3. You define scopes that select tables by namespace and name pattern
  4. You attach scopes to monitors (for observation) or replications (for copying)
  5. If source and target use different storage, location mappings translate paths during replication

Consistency Checking

After replication, you can run a consistency check to verify that the target catalog matches the source. The check compares table states and reports whether they are consistent, inconsistent, or if the check failed. This is especially useful after initial one-time replications or when investigating potential data drift.


License Enforcement

Ice Flow integrates with Symphony's license enforcement system. When enforcement is active — either globally or specifically for the Ice Flow extension — all replication and monitoring activities are paused. The UI displays a banner indicating the enforcement state. Activities resume automatically when enforcement is lifted.