GeoParquet Best Practices¶

This guide explains the optimizations that make GeoParquet files fast and efficient for spatial queries.

Quick Checklist¶

Run gpio check all myfile.parquet to verify your file follows these best practices:

[ ] Spatial ordering (Hilbert curve)
[ ] Bbox column with covering metadata
[ ] ZSTD compression
[ ] Appropriate row group sizes

Spatial Ordering¶

What It Is¶

Spatial ordering arranges rows so that geographically nearby features are stored together in the file. gpio uses Hilbert curve ordering, which maps 2D space to 1D while preserving locality.

Why It Matters¶

Without spatial ordering:

Row 1: New York
Row 2: Tokyo
Row 3: London
Row 4: Sydney
...

With Hilbert ordering:

Row 1: New York
Row 2: Boston
Row 3: Philadelphia
Row 4: Washington DC
...

Benefits: - Spatial queries read fewer row groups - Better compression (similar coordinates compress well) - Reduced I/O for bounding box filters

How to Apply¶

# Sort existing file
gpio sort hilbert input.parquet sorted.parquet

# Convert with automatic Hilbert ordering (default)
gpio convert input.shp output.parquet

# Convert without Hilbert ordering (faster but less optimal)
gpio convert input.shp output.parquet --skip-hilbert

Bounding Box Columns¶

What They Are¶

A bbox column stores the bounding box for each feature as a struct:

bbox: {xmin: -122.5, ymin: 37.5, xmax: -122.0, ymax: 38.0}

Why They Matter¶

Spatial queries typically need to check "does this feature intersect my area of interest?"

Without bbox: Must decode WKB geometry and compute intersection (slow) With bbox: Compare 4 numbers (fast), only decode geometry for candidates

Performance difference: 10-100x faster for spatial filters on large files.

Covering Metadata¶

GeoParquet 1.1+ includes "covering" metadata that tells query engines how to use bbox columns:

"covering": {
  "bbox": {
    "xmin": ["bbox", "xmin"],
    "ymin": ["bbox", "ymin"],
    "xmax": ["bbox", "xmax"],
    "ymax": ["bbox", "ymax"]
  }
}

This enables automatic optimization in tools like DuckDB and BigQuery.

How to Apply¶

# Add bbox column with metadata
gpio add bbox input.parquet output.parquet

# Add bbox metadata to existing bbox column
gpio add bbox-metadata myfile.parquet

# Convert with automatic bbox (default)
gpio convert input.shp output.parquet

Compression¶

Recommendations¶

Use Case	Compression	Level	Rationale
General purpose	ZSTD	15	Best balance of size and speed
Maximum compression	ZSTD	22	Smaller files, slower write
Fast decompression	LZ4	-	Analytics workloads
Wide compatibility	GZIP	6	Older tools

gpio uses ZSTD level 15 by default.

Why ZSTD?¶

3-5x faster decompression than GZIP
Similar or better compression ratio
Widely supported in modern tools

How to Apply¶

# Default ZSTD compression
gpio convert input.shp output.parquet

# Maximum compression
gpio convert input.shp output.parquet --compression ZSTD --compression-level 22

# Fast decompression
gpio convert input.shp output.parquet --compression LZ4

Row Group Sizing¶

What Row Groups Are¶

Parquet files are divided into row groups - independent chunks that can be read separately. Each row group has its own statistics (min/max values).

Optimal Sizes¶

Metric	Recommendation
Compressed size	50-100 MB per row group
Row count	50,000-150,000 rows (depends on data)

Why Size Matters¶

Too small: - Excessive metadata overhead - More seeks for sequential reads - Reduced compression efficiency

Too large: - Must read entire row group even for small queries - Higher memory usage during processing

How to Control¶

# Target row group size in MB
gpio extract input.parquet output.parquet --row-group-size-mb 64MB

# Exact row count
gpio extract input.parquet output.parquet --row-group-size 100000

Complete Optimization Pipeline¶

For a new file:

# 1. Convert with all optimizations (default)
gpio convert input.shp optimized.parquet

# 2. Verify optimizations
gpio check all optimized.parquet

For an existing GeoParquet file:

# 1. Check current state
gpio check all existing.parquet

# 2. Add bbox if missing
gpio add bbox existing.parquet with_bbox.parquet

# 3. Apply spatial ordering
gpio sort hilbert with_bbox.parquet optimized.parquet

# 4. Verify
gpio check all optimized.parquet

Or let gpio fix everything:

# Auto-fix all issues
gpio check all existing.parquet --fix --fix-output optimized.parquet

Measuring Improvement¶

Compare query performance before and after optimization:

# Time a spatial query
time duckdb -c "
  SELECT COUNT(*)
  FROM 'unoptimized.parquet'
  WHERE ST_Intersects(geometry, ST_GeomFromText('POLYGON(...)'))
"

time duckdb -c "
  SELECT COUNT(*)
  FROM 'optimized.parquet'
  WHERE ST_Intersects(geometry, ST_GeomFromText('POLYGON(...)'))
"

Typical improvements: 5-20x faster for spatial queries.

Large File Processing¶

gpio handles larger-than-memory files efficiently through its write strategy system.

Default Behavior¶

The default duckdb-kv strategy uses constant memory regardless of file size:

# Process a 100GB file on a 8GB machine - just works
gpio extract huge_file.parquet filtered.parquet --bbox -122.5,37.5,-122.0,38.0

Memory Configuration¶

gpio automatically detects available memory and uses 50% of it for streaming operations. This is container-aware, respecting Docker/Kubernetes memory limits.

For explicit control:

# Limit memory in constrained environments
gpio extract input.parquet output.parquet --write-memory 512MB

Strategy Selection¶

Use Case	Recommended Approach
Production workloads	Use default (`duckdb-kv`)
Container deployments	Default + explicit `--write-memory` if needed
Debugging output issues	Use `--write-strategy in-memory` to verify
DuckDB compatibility issues	Try `--write-strategy streaming`

Tips for Large Files¶

Skip Hilbert for faster writes: --skip-hilbert during initial conversion
Use bbox column: Enables row group pruning for spatial queries
Appropriate row group sizes: Target 50-100 MB compressed per row group
Process locally when possible: Download remote files for very large datasets (>50GB)

See the Write Strategies Guide for detailed information.

GeoParquet Best Practices¶

Quick Checklist¶

Spatial Ordering¶

What It Is¶

Why It Matters¶

How to Apply¶

Bounding Box Columns¶

What They Are¶

Why They Matter¶

Covering Metadata¶

How to Apply¶

Compression¶

Recommendations¶

Why ZSTD?¶

How to Apply¶

Row Group Sizing¶

What Row Groups Are¶

Optimal Sizes¶

Why Size Matters¶

How to Control¶

Complete Optimization Pipeline¶

Measuring Improvement¶

Large File Processing¶

Default Behavior¶

Memory Configuration¶

Strategy Selection¶

Tips for Large Files¶

See Also¶