Apache Iceberg is a remarkable open-source format designed for massive analytic tables. It brings the reliability and simplicity of SQL tables to big data, while allowing engines like Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig to safely work with the same tables simultaneously123.
Here are some key features of Apache Iceberg:
- High-Performance Format:
- Iceberg is optimized for large-scale analytic datasets.
- It enables efficient operations on massive tables, making it suitable for big data scenarios.
- SQL-Like Commands:
- Iceberg supports expressive SQL commands for tasks like merging new data, updating existing rows, and performing targeted deletes.
- You can eagerly rewrite data files for read performance or use delete deltas for faster updates.
- Schema Evolution:
- Adding or modifying columns won’t result in “zombie” data.
- Columns can be renamed and reordered without requiring a full table rewrite.
- Hidden Partitioning:
- Iceberg automates the task of producing partition values for rows in a table.
- It skips unnecessary partitions and files automatically, improving query performance.
- Time Travel and Rollback:
- Time-travel functionality allows reproducible queries using specific table snapshots.
- Version rollback enables quick corrections by resetting tables to a known good state.
- Data Compaction:
- Iceberg supports out-of-the-box data compaction.
- You can choose from different rewrite strategies (such as bin-packing or sorting) to optimize file layout and size.
Configuration
Troubleshooting