DuckDB’s native Iceberg support makes it easy to query Faceberg catalogs using SQL.
Setup
Install DuckDB with Iceberg support:
pip install duckdb
Method 1: Interactive Shell with quack
The easiest way to query is using the quack command:
# Start REST server (in one terminal)faceberg mycatalog serve# Open DuckDB shell with catalog attached (in another terminal)faceberg mycatalog quack
Then run SQL directly:
-- List all tablesSHOW ALLTABLES;-- Query dataSELECTlabel, substr(text, 1, 100) as previewFROM iceberg_catalog.default.imdbLIMIT10;-- Aggregate queriesSELECTlabel, COUNT(*) ascountFROM iceberg_catalog.default.imdbGROUPBYlabel;-- Exit.quit
Method 2: Direct Metadata Query
Query Iceberg tables directly using iceberg_scan():
import duckdbconn = duckdb.connect()conn.execute("INSTALL iceberg; LOAD iceberg")# Query using metadata file pathresult = conn.execute(""" SELECT label, substr(text, 1, 80) as preview FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json') LIMIT 5""").fetchdf()print(result)
label preview
0 0 I rented I AM CURIOUS-YELLOW from my video sto...
1 0 "I Am Curious: Yellow" is a risible and preten...
2 0 If only to avoid making this type of film in t...
3 0 This film was probably inspired by Godard's Ma...
4 0 Oh, brother...after hearing about this ridicul...
Method 3: REST Catalog Attachment
Attach a REST catalog for a more integrated experience:
import duckdbconn = duckdb.connect()conn.execute("INSTALL iceberg; LOAD iceberg")# Attach REST catalog (server must be running)conn.execute(""" ATTACH 'http://localhost:8181' AS cat ( TYPE ICEBERG, AUTHORIZATION_TYPE 'none' )""")# Query with standard SQLresult = conn.execute(""" SELECT * FROM cat.default.imdb LIMIT 10""").fetchdf()
For remote catalogs on HuggingFace:
conn.execute(""" ATTACH 'https://user-mycatalog.hf.space' AS cat (TYPE ICEBERG)""")
Filtering by Split
Tables are partitioned by split (train/test/validation). Filter efficiently:
import duckdbconn = duckdb.connect()conn.execute("INSTALL iceberg; LOAD iceberg")# Only reads the test partitionresult = conn.execute(""" SELECT split, COUNT(*) as count FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json') GROUP BY split""").fetchdf()print(result)
import duckdbconn = duckdb.connect()conn.execute("INSTALL iceberg; LOAD iceberg")# Label distributionresult = conn.execute(""" SELECT label, COUNT(*) as count, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json') GROUP BY label ORDER BY count DESC""").fetchdf()print(result)
import duckdbconn = duckdb.connect()conn.execute("INSTALL iceberg; LOAD iceberg")# Text length analysisresult = conn.execute(""" SELECT label, AVG(LENGTH(text)) as avg_length, MIN(LENGTH(text)) as min_length, MAX(LENGTH(text)) as max_length FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json') GROUP BY label""").fetchdf()print(result)