Pandas

Load Faceberg tables into Pandas DataFrames

PyIceberg provides native Pandas integration for loading Iceberg tables into DataFrames.

Setup

Load Table to DataFrame

from faceberg import catalog

cat = catalog("./mycatalog")
table = cat.load_table("default.imdb")

# Load entire table (be careful with large datasets!)
# df = table.scan().to_pandas()

# Load with limit
df = table.scan(limit=100).to_pandas()
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
Shape: (100, 3)
Columns: ['split', 'text', 'label']

Select Columns

Only load the columns you need:

from faceberg import catalog

cat = catalog("./mycatalog")
table = cat.load_table("default.imdb")

# Select specific columns
df = table.scan(limit=10).select("label", "text").to_pandas()
print(df.head())
                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0

Filter Rows

Filter data before loading into memory:

from faceberg import catalog
from pyiceberg.expressions import EqualTo

cat = catalog("./mycatalog")
table = cat.load_table("default.imdb")

# Filter by split (partition pruning)
df = table.scan(
    row_filter=EqualTo("split", "test"),
    limit=10
).to_pandas()

print(f"Split values: {df['split'].unique()}")
print(f"Rows: {len(df)}")
Split values: <ArrowStringArray>
['test']
Length: 1, dtype: str
Rows: 10

Table Schema

Inspect the table schema:

from faceberg import catalog

cat = catalog("./mycatalog")
table = cat.load_table("default.imdb")

# View schema
print("Schema:")
for field in table.schema().fields:
    print(f"  {field.name}: {field.field_type}")
Schema:
  split: string
  text: string
  label: long

Table Statistics

Get table metadata without loading data:

from faceberg import catalog

cat = catalog("./mycatalog")
table = cat.load_table("default.imdb")

# Current snapshot
snapshot = table.current_snapshot()
if snapshot:
    print(f"Snapshot ID: {snapshot.snapshot_id}")

    # Summary statistics
    summary = snapshot.summary
    if summary:
        print(f"Total records: {summary.get('total-records', 'N/A')}")
        print(f"Total files: {summary.get('total-data-files', 'N/A')}")
Snapshot ID: 1
Total records: 100000
Total files: 3

Working with Large Datasets

For large datasets, process in batches:

from faceberg import catalog

cat = catalog("./mycatalog")
table = cat.load_table("default.imdb")

# Process in batches using Arrow
scan = table.scan(limit=100)

# Get as Arrow table (more memory efficient)
arrow_table = scan.to_arrow()
print(f"Arrow table: {arrow_table.num_rows} rows, {arrow_table.num_columns} columns")

# Convert to Pandas when needed
df = arrow_table.to_pandas()
Arrow table: 100 rows, 3 columns

Data Analysis Example

from faceberg import catalog

cat = catalog("./mycatalog")
table = cat.load_table("default.imdb")

df = table.scan(limit=1000).to_pandas()

# Basic analysis
print("Label distribution:")
print(df['label'].value_counts())

print("\nText length statistics:")
df['text_length'] = df['text'].str.len()
print(df.groupby('label')['text_length'].describe())
Label distribution:
label
0    1000
Name: count, dtype: int64

Text length statistics:
        count      mean         std   min     25%    50%      75%     max
label                                                                    
0      1000.0  1311.289  980.544015  65.0  718.75  985.0  1548.25  6103.0

Memory Considerations

Method Use Case
scan(limit=N).to_pandas() Exploration, sampling
scan().select("col1", "col2").to_pandas() Need subset of columns
scan(row_filter=...).to_pandas() Need subset of rows
scan().to_arrow() Memory-efficient processing

For large datasets:

  1. Filter first — Use row_filter to reduce data before loading
  2. Select columns — Only load needed columns with select()
  3. Use Arrow — Process with to_arrow() for better memory efficiency
  4. Stream batches — Process in chunks for very large data

Remote Catalogs

For catalogs on HuggingFace:

import os
from faceberg import catalog

cat = catalog("user/mycatalog", hf_token=os.environ.get("HF_TOKEN"))
table = cat.load_table("default.imdb")
df = table.scan(limit=100).to_pandas()

Next Steps