DuckDB

Query Faceberg catalogs with DuckDB

DuckDB’s native Iceberg support makes it easy to query Faceberg catalogs using SQL.

Setup

Install DuckDB with Iceberg support:

pip install duckdb

Method 1: Interactive Shell with quack

The easiest way to query is using the quack command:

# Start REST server (in one terminal)
faceberg mycatalog serve

# Open DuckDB shell with catalog attached (in another terminal)
faceberg mycatalog quack

Then run SQL directly:

-- List all tables
SHOW ALL TABLES;

-- Query data
SELECT label, substr(text, 1, 100) as preview
FROM iceberg_catalog.default.imdb
LIMIT 10;

-- Aggregate queries
SELECT label, COUNT(*) as count
FROM iceberg_catalog.default.imdb
GROUP BY label;

-- Exit
.quit

Method 2: Direct Metadata Query

Query Iceberg tables directly using iceberg_scan():

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")

# Query using metadata file path
result = conn.execute("""
    SELECT label, substr(text, 1, 80) as preview
    FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json')
    LIMIT 5
""").fetchdf()

print(result)
   label                                            preview
0      0  I rented I AM CURIOUS-YELLOW from my video sto...
1      0  "I Am Curious: Yellow" is a risible and preten...
2      0  If only to avoid making this type of film in t...
3      0  This film was probably inspired by Godard's Ma...
4      0  Oh, brother...after hearing about this ridicul...

Method 3: REST Catalog Attachment

Attach a REST catalog for a more integrated experience:

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")

# Attach REST catalog (server must be running)
conn.execute("""
    ATTACH 'http://localhost:8181' AS cat (
        TYPE ICEBERG,
        AUTHORIZATION_TYPE 'none'
    )
""")

# Query with standard SQL
result = conn.execute("""
    SELECT * FROM cat.default.imdb LIMIT 10
""").fetchdf()

For remote catalogs on HuggingFace:

conn.execute("""
    ATTACH 'https://user-mycatalog.hf.space' AS cat (TYPE ICEBERG)
""")

Filtering by Split

Tables are partitioned by split (train/test/validation). Filter efficiently:

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")

# Only reads the test partition
result = conn.execute("""
    SELECT split, COUNT(*) as count
    FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json')
    GROUP BY split
""").fetchdf()

print(result)
          split  count
0  unsupervised  50000
1          test  25000
2         train  25000

Analytics Queries

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")

# Label distribution
result = conn.execute("""
    SELECT
        label,
        COUNT(*) as count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
    FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json')
    GROUP BY label
    ORDER BY count DESC
""").fetchdf()

print(result)
   label  count  percentage
0     -1  50000        50.0
1      0  25000        25.0
2      1  25000        25.0

Working with Text Data

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")

# Text length analysis
result = conn.execute("""
    SELECT
        label,
        AVG(LENGTH(text)) as avg_length,
        MIN(LENGTH(text)) as min_length,
        MAX(LENGTH(text)) as max_length
    FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json')
    GROUP BY label
""").fetchdf()

print(result)
   label  avg_length  min_length  max_length
0      1  1324.79768          65       13704
1      0  1294.06436          32        8969
2     -1  1329.90250          43       14282

Export Results

Export query results to various formats:

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")

# Export to Parquet
conn.execute("""
    COPY (
        SELECT * FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json')
        WHERE split = 'test'
    ) TO 'test_data.parquet' (FORMAT PARQUET)
""")

# Export to CSV
conn.execute("""
    COPY (
        SELECT label, substr(text, 1, 200) as text
        FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json')
        LIMIT 1000
    ) TO 'sample.csv' (FORMAT CSV, HEADER)
""")

Performance Tips

  1. Filter by split — Partition pruning skips unnecessary files
  2. Project columns — Select only needed columns
  3. Use LIMIT — For exploration, limit results
  4. Aggregate pushdown — DuckDB pushes aggregations to Parquet
-- Efficient: only reads needed partitions and columns
SELECT label, COUNT(*)
FROM iceberg_scan('...')
WHERE split = 'train'
GROUP BY label;

Next Steps