Faceberg
Bridge HuggingFace datasets with Apache Iceberg tables
Faceberg maps HuggingFace datasets to Apache Iceberg tables without copying data. Your catalog metadata is stored on HuggingFace Spaces, and any Iceberg-compatible query engine can access the data.
Installation
pip install facebergPrerequisites
Set up your HuggingFace token:
export HF_TOKEN=your_huggingface_tokenGet your token from HuggingFace Settings.
Create a Catalog
Create a new catalog on HuggingFace Hub:
faceberg user/mycatalog initThis creates a HuggingFace Space that:
- Stores your Iceberg catalog metadata
- Auto-deploys a REST server at
https://user-mycatalog.hf.space - Follows the Apache Iceberg REST catalog specification
Add Datasets
Add HuggingFace datasets to your catalog. The table name is inferred from the dataset:
# Add datasets (table name inferred: org.repo)
faceberg user/mycatalog add stanfordnlp/imdb --config plain_text
faceberg user/mycatalog add Salesforce/wikitext --config wikitext-2-v1
faceberg user/mycatalog add openai/gsm8k --config mainYou can also specify an explicit table name:
faceberg user/mycatalog add stanfordnlp/imdb --table default.movies --config plain_textList Tables
faceberg user/mycatalog listQuery Data
Scan a table to see sample data:
faceberg user/mycatalog scan stanfordnlp.imdb --limit 5Interactive Queries with DuckDB
Use the quack command to open an interactive DuckDB shell:
faceberg user/mycatalog quackRun SQL queries directly:
-- Show all tables
SHOW ALL TABLES;
-- Query the IMDB dataset
SELECT label, substr(text, 1, 100) as preview
FROM iceberg_catalog.stanfordnlp.imdb
LIMIT 10;
-- Exit
.quitQuery using PyIceberg
Connect to your remote catalog using the Python API:
import os
from faceberg import catalog
# Connect to your remote catalog
cat = catalog("user/mycatalog", hf_token=os.environ.get("HF_TOKEN"))
# Load a table
table = cat.load_table("stanfordnlp.imdb")
# Query with PyIceberg
df = table.scan(limit=100).to_pandas()
print(df.head())
# Filter by partition (efficient - only reads matching files)
from pyiceberg.expressions import EqualTo
df = table.scan(row_filter=EqualTo("split", "train")).to_pandas()Catalog API
The catalog object supports standard Iceberg operations:
| Method | Description |
|---|---|
init() |
Initialize the catalog storage |
add_dataset(identifier, repo, config) |
Add a HuggingFace dataset as an Iceberg table |
sync_dataset(identifier) |
Sync a single dataset (update if source changed) |
sync_datasets() |
Sync all datasets in the catalog |
load_table(identifier) |
Load a table for querying |
list_tables(namespace) |
List tables in a namespace |
list_namespaces() |
List all namespaces |
drop_table(identifier) |
Remove a table from the catalog |
table_exists(identifier) |
Check if a table exists |
How It Works
Faceberg creates lightweight Iceberg metadata that points to original HuggingFace dataset files:
HuggingFace Hub
┌─────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────────┐ ┌─────────────────────────┐ │
│ │ HF Datasets │ │ HF Spaces (Catalog) │ │
│ │ (Original Parquet) │◄───│ • Iceberg metadata │ │
│ │ │ │ • REST API endpoint │ │
│ │ stanfordnlp/imdb/ │ │ • faceberg.yml │ │
│ │ └── *.parquet │ │ │ │
│ └─────────────────────┘ └───────────┬─────────────┘ │
│ │ │
└─────────────────────────────────────────┼───────────────┘
│ Iceberg REST API
▼
┌─────────────────────────┐
│ Query Engines │
│ DuckDB, Pandas, Spark │
└─────────────────────────┘
No data is copied — only metadata is created. Query with DuckDB, PyIceberg, Spark, or any Iceberg-compatible tool.
Serve Your Catalog Locally
Run a local REST server for development or testing:
faceberg user/mycatalog serve --port 8181This starts an Iceberg REST catalog server at http://localhost:8181 that any compatible tool can connect to:
import duckdb
conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")
conn.execute("ATTACH 'http://localhost:8181' AS cat (TYPE ICEBERG)")
result = conn.execute("SELECT * FROM cat.stanfordnlp.imdb LIMIT 5").fetchdf()Next Steps
- Local Catalogs — Use local catalogs for testing
- Architecture — Understand how Faceberg works
- DuckDB Integration — Advanced DuckDB queries
- Pandas Integration — Load data into DataFrames