Local Catalogs

Use local catalogs for testing and development

Local catalogs store metadata on your filesystem instead of HuggingFace Hub. They’re useful for:

Create a Local Catalog

from faceberg import catalog

# Create catalog in current directory
cat = catalog("mycatalog")
cat.init()

# Add datasets
cat.add_dataset("default.imdb", "stanfordnlp/imdb", config="plain_text")
cat.add_dataset("default.gsm8k", "openai/gsm8k", config="main")

# List tables
tables = cat.list_tables("default")
print(f"Tables: {[str(t) for t in tables]}")
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Tables: ['default.imdb', 'default.gsm8k']

CLI Usage

The same CLI commands work for local catalogs:

# Initialize
faceberg ./mycatalog init

# Add datasets
faceberg ./mycatalog add default.imdb stanfordnlp/imdb --config plain_text

# List tables
faceberg ./mycatalog list

# Scan data
faceberg ./mycatalog scan default.imdb --limit 5

Catalog Directory Structure

After adding datasets, your catalog looks like this:

mycatalog/
├── faceberg.yml           # Catalog configuration
└── default/               # Namespace
    ├── imdb/
    │   └── metadata/
    │       ├── v1.metadata.json    # Iceberg table metadata
    │       ├── snap-*.avro         # Manifest list (snapshot)
    │       ├── *.avro              # Manifest files
    │       └── version-hint.text   # Current version pointer
    └── gsm8k/
        └── metadata/
            └── ...

faceberg.yml

The configuration file tracks dataset mappings:

from pathlib import Path

config_path = Path("mycatalog/faceberg.yml")
print(config_path.read_text())
default:
  imdb:
    type: dataset
    repo: stanfordnlp/imdb
    config: plain_text
  gsm8k:
    type: dataset
    repo: openai/gsm8k
    config: main

Metadata Files

Each table’s metadata/ directory contains:

File Purpose
v1.metadata.json Main Iceberg table metadata (schema, partitions, snapshots)
snap-*.avro Manifest list pointing to manifest files
*.avro Manifest files with data file references
version-hint.text Current metadata version for quick lookups

Serve via REST API

Expose your local catalog via REST for testing:

# Start server on default port (8181)
faceberg ./mycatalog serve

# Custom port
faceberg ./mycatalog serve --port 9000

# Development mode with auto-reload
faceberg ./mycatalog serve --reload

The server provides:

  • REST catalog at http://localhost:8181
  • API docs at http://localhost:8181/schema/swagger

Connect from Another Process

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")

# Attach local REST catalog
conn.execute("""
    ATTACH 'http://localhost:8181' AS cat (
        TYPE ICEBERG,
        AUTHORIZATION_TYPE 'none'
    )
""")

result = conn.execute("SELECT * FROM cat.default.imdb LIMIT 5").fetchdf()

Interactive DuckDB Shell

The quack command opens DuckDB with the catalog pre-attached:

# Start REST server first (in another terminal)
faceberg ./mycatalog serve

# Open interactive shell
faceberg ./mycatalog quack

Then run SQL:

SHOW ALL TABLES;
SELECT * FROM iceberg_catalog.default.imdb LIMIT 5;

Query Without REST Server

You can also query directly using the metadata file path:

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")

# Query using direct metadata path
result = conn.execute("""
    SELECT label, substr(text, 1, 80) as preview
    FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json')
    LIMIT 3
""").fetchdf()

print(result)
   label                                            preview
0      0  I rented I AM CURIOUS-YELLOW from my video sto...
1      0  "I Am Curious: Yellow" is a risible and preten...
2      0  If only to avoid making this type of film in t...

Testing Pattern

For tests, create isolated catalogs in temporary directories:

import tempfile
from pathlib import Path
from faceberg import LocalCatalog

# Create temp directory for test
with tempfile.TemporaryDirectory() as tmpdir:
    catalog_path = Path(tmpdir) / "test_catalog"
    catalog_path.mkdir()

    cat = LocalCatalog(
        name="test",
        uri=f"file://{catalog_path}"
    )
    cat.init()
    cat.add_dataset("default.imdb", "stanfordnlp/imdb", config="plain_text")

    # Run tests...
    table = cat.load_table("default.imdb")
    df = table.scan(limit=1).to_pandas()
    assert len(df) == 1
    print("Test passed!")
Test passed!

Next Steps