Skip to content

Python API

gpio provides a fluent Python API for GeoParquet transformations. This API offers the best performance by keeping data in memory as Arrow tables, avoiding file I/O entirely.

Installation

pipx install geoparquet-io
pip install geoparquet-io
# or: uv add geoparquet-io

Quick Start

import geoparquet_io as gpio

# Read, transform, and write in a fluent chain
gpio.read('input.parquet') \
    .add_bbox() \
    .add_quadkey(resolution=12) \
    .sort_hilbert() \
    .write('output.parquet')

Reading Data

Use gpio.read() to load a GeoParquet file:

import geoparquet_io as gpio

# Read a file
table = gpio.read('places.parquet')

# Access properties
print(f"Rows: {table.num_rows}")
print(f"Columns: {table.column_names}")
print(f"Geometry column: {table.geometry_column}")

Reading from BigQuery

Use Table.from_bigquery() to read directly from BigQuery tables. The table_id parameter accepts fully-qualified "project.dataset.table" format or "dataset.table" when a separate project argument is provided (or when using your default gcloud project). When using bbox, provide coordinates as "minx,miny,maxx,maxy" representing longitude,latitude in EPSG:4326 degrees (e.g., "-122.52,37.70,-122.35,37.82").

import geoparquet_io as gpio

# Basic read
table = gpio.Table.from_bigquery('myproject.geodata.buildings')

# With filtering
table = gpio.Table.from_bigquery(
    'myproject.geodata.buildings',
    where="area_sqm > 1000",
    columns=['id', 'name', 'geography'],
    limit=10000
)

# With spatial filtering (bbox)
table = gpio.Table.from_bigquery(
    'myproject.geodata.buildings',
    bbox="-122.52,37.70,-122.35,37.82"
)

# With explicit credentials
table = gpio.Table.from_bigquery(
    'myproject.geodata.buildings',
    credentials_file='/path/to/service-account.json'
)

# Chain with other operations
gpio.Table.from_bigquery('myproject.geodata.buildings', limit=10000) \
    .add_bbox() \
    .sort_hilbert() \
    .write('output.parquet')

Bbox filtering modes:

When using bbox, control where filtering happens with bbox_mode:

# Server-side filtering (best for large tables)
table = gpio.Table.from_bigquery(
    'myproject.geodata.global_buildings',
    bbox="-122.52,37.70,-122.35,37.82",
    bbox_mode="server"
)

# Local filtering (best for small tables)
table = gpio.Table.from_bigquery(
    'myproject.geodata.city_parks',
    bbox="-122.52,37.70,-122.35,37.82",
    bbox_mode="local"
)

# Custom threshold for auto mode (default: 500000)
table = gpio.Table.from_bigquery(
    'myproject.geodata.buildings',
    bbox="-122.52,37.70,-122.35,37.82",
    bbox_threshold=100000  # Use server for tables > 100K rows
)

See the Extract Guide for detailed tradeoff analysis.

BigQuery Limitations

  • Cannot read views or external tables (Storage Read API limitation)
  • BIGNUMERIC columns are not supported

Reading from ArcGIS Feature Services

Use gpio.extract_arcgis() to download features from ArcGIS REST Feature Services. Server-side filtering is applied for efficient data transfer.

import geoparquet_io as gpio

# Basic read from public service
table = gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0'
)

# With server-side filtering
table = gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0',
    where="STATE_NAME = 'California'",
    bbox=(-122.5, 37.5, -122.0, 38.0),
    include_cols='NAME,POPULATION,STATE_NAME',
    limit=10000
)

# With authentication
table = gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0',
    token='your_arcgis_token'
)

# With username/password authentication
table = gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0',
    username='myuser',
    password='mypassword'
)

# Chain with other operations
gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0',
    limit=10000
).add_bbox().sort_hilbert().write('output.parquet')

Parameters:

Parameter Type Description
service_url str ArcGIS Feature Service URL with layer ID
token str Direct authentication token
token_file str Path to file containing token
username str ArcGIS Online/Enterprise username
password str ArcGIS password (requires username)
portal_url str Enterprise portal URL for token generation
where str SQL WHERE clause (default: "1=1" = all)
bbox tuple Bounding box filter (xmin, ymin, xmax, ymax) in WGS84
include_cols str Comma-separated columns to include
exclude_cols str Comma-separated columns to exclude
limit int Maximum number of features

No automatic Hilbert sorting

Unlike the CLI gpio extract arcgis command, the Python API does NOT apply Hilbert sorting by default. Chain .sort_hilbert() explicitly if you want spatial ordering.

Table Class

The Table class wraps a PyArrow Table and provides chainable transformation methods.

Properties

Property Description
num_rows Number of rows in the table
column_names List of column names
geometry_column Name of the geometry column
crs CRS as PROJJSON dict or string (None = OGC:CRS84 default)
bounds Bounding box tuple (xmin, ymin, xmax, ymax)
schema PyArrow Schema object
geoparquet_version GeoParquet version string (e.g., "1.1")
table = gpio.read('data.parquet')

# Get CRS
print(table.crs)  # e.g., {'id': {'authority': 'EPSG', 'code': 4326}, ...}

# Get bounds
print(table.bounds)  # e.g., (-122.5, 37.5, -122.0, 38.0)

# Get schema
for field in table.schema:
    print(f"{field.name}: {field.type}")

Methods

info(verbose=True)

Print or return summary information about the table.

# Print formatted summary
table.info()
# Table: 766 rows, 6 columns
# Geometry: geometry
# CRS: EPSG:4326
# Bounds: [-122.500000, 37.500000, -122.000000, 38.000000]
# GeoParquet: 1.1

# Get as dictionary
info_dict = table.info(verbose=False)
print(info_dict['rows'])  # 766
print(info_dict['crs'])   # None or CRS dict

head(n=10) / tail(n=10)

Get the first or last N rows.

# First 10 rows (default)
first_rows = table.head()

# First 50 rows
first_50 = table.head(50)

# Last 10 rows (default)
last_rows = table.tail()

# Last 5 rows
last_5 = table.tail(5)

# Chain with other operations
preview = table.head(100).add_bbox()

stats()

Calculate column statistics.

stats = table.stats()

# Access stats for a column
print(stats['population']['min'])     # Minimum value
print(stats['population']['max'])     # Maximum value
print(stats['population']['nulls'])   # Null count
print(stats['population']['unique'])  # Approximate unique count

# Geometry columns have only null counts
print(stats['geometry']['nulls'])

metadata(include_parquet_metadata=False)

Get GeoParquet and schema metadata.

meta = table.metadata()

# Access metadata
print(meta['geoparquet_version'])  # e.g., '1.1.0'
print(meta['geometry_column'])     # e.g., 'geometry'
print(meta['crs'])                 # CRS dict or None
print(meta['bounds'])              # (xmin, ymin, xmax, ymax)
print(meta['columns'])             # List of column info dicts

# Full geo metadata from 'geo' key
geo_meta = meta.get('geo_metadata', {})

# Include raw Parquet schema metadata
full_meta = table.metadata(include_parquet_metadata=True)

to_geojson(output_path=None, precision=7, write_bbox=False, id_field=None)

Convert to GeoJSON.

# Write to file
table.to_geojson('output.geojson')

# With options
table.to_geojson('output.geojson', precision=5, write_bbox=True)

# Get as string (no file output)
geojson_str = table.to_geojson()

add_bbox(column_name='bbox')

Add a bounding box struct column computed from geometry.

table = gpio.read('input.parquet').add_bbox()
# or with custom name
table = gpio.read('input.parquet').add_bbox(column_name='bounds')

add_quadkey(column_name='quadkey', resolution=13, use_centroid=False)

Add a quadkey column based on geometry location.

# Default resolution (13)
table = gpio.read('input.parquet').add_quadkey()

# Custom resolution
table = gpio.read('input.parquet').add_quadkey(resolution=10)

# Force centroid calculation even if bbox exists
table = gpio.read('input.parquet').add_quadkey(use_centroid=True)

add_h3(column_name='h3_cell', resolution=9)

Add an H3 hexagonal cell column based on geometry location.

# Default resolution (9, ~100m cells)
table = gpio.read('input.parquet').add_h3()

# Lower resolution for larger cells
table = gpio.read('input.parquet').add_h3(resolution=6)

# Custom column name
table = gpio.read('input.parquet').add_h3(column_name='hex_id', resolution=8)

add_s2(column_name='s2_cell', level=13)

Add an S2 spherical cell column based on geometry location.

# Default level (13, ~1.2 km² cells)
table = gpio.read('input.parquet').add_s2()

# Lower level for larger cells
table = gpio.read('input.parquet').add_s2(level=10)

# Custom column name
table = gpio.read('input.parquet').add_s2(column_name='s2_index', level=15)

add_kdtree(column_name='kdtree_cell', iterations=9, sample_size=100000)

Add a KD-tree cell column for data-adaptive spatial partitioning.

# Default settings (512 partitions = 2^9)
table = gpio.read('input.parquet').add_kdtree()

# Fewer partitions
table = gpio.read('input.parquet').add_kdtree(iterations=6)  # 64 partitions

# More partitions with larger sample
table = gpio.read('input.parquet').add_kdtree(iterations=12, sample_size=500000)

sort_hilbert()

Reorder rows using Hilbert curve ordering for better spatial locality.

table = gpio.read('input.parquet').sort_hilbert()

sort_column(column_name, descending=False)

Sort rows by a specified column.

# Sort by name ascending
table = gpio.read('input.parquet').sort_column('name')

# Sort by population descending
table = gpio.read('input.parquet').sort_column('population', descending=True)

sort_quadkey(column_name='quadkey', resolution=13, use_centroid=False, remove_column=False)

Sort rows by quadkey for spatial locality. If no quadkey column exists, one is added automatically.

# Sort by quadkey (auto-adds column if needed)
table = gpio.read('input.parquet').sort_quadkey()

# Sort and remove the quadkey column afterward
table = gpio.read('input.parquet').sort_quadkey(remove_column=True)

# Use existing quadkey column
table = gpio.read('input.parquet').sort_quadkey(column_name='my_quadkey')

reproject(target_crs='EPSG:4326', source_crs=None)

Reproject geometry to a different coordinate reference system.

# Reproject to WGS84 (auto-detects source CRS from metadata)
table = gpio.read('input.parquet').reproject(target_crs='EPSG:4326')

# Reproject with explicit source CRS
table = gpio.read('input.parquet').reproject(
    target_crs='EPSG:3857',
    source_crs='EPSG:4326'
)

extract(columns=None, exclude_columns=None, bbox=None, where=None, limit=None)

Filter columns and rows.

# Select specific columns
table = gpio.read('input.parquet').extract(columns=['name', 'address'])

# Exclude columns
table = gpio.read('input.parquet').extract(exclude_columns=['temp_id'])

# Limit rows
table = gpio.read('input.parquet').extract(limit=1000)

# Spatial filter
table = gpio.read('input.parquet').extract(bbox=(-122.5, 37.5, -122.0, 38.0))

# SQL WHERE clause
table = gpio.read('input.parquet').extract(where="population > 10000")

write(path, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, write_strategy=None, write_memory=None)

Write the table to a GeoParquet file. Returns the output Path for chaining or confirmation.

# Basic write
path = table.write('output.parquet')
print(f"Wrote to {path}")

# With compression options
table.write('output.parquet', compression='GZIP', compression_level=6)

# With row group size
table.write('output.parquet', row_group_size_mb=128)

Write Strategy Options

For large files, you can control memory usage with write strategies:

# Use streaming strategy (constant memory usage)
table.write('output.parquet', write_strategy='streaming')

# Limit DuckDB memory for containerized environments
table.write('output.parquet', write_memory='512MB')

# Combine both options
table.write('output.parquet', write_strategy='streaming', write_memory='1GB')
Parameter Type Description
write_strategy str Write strategy: duckdb-kv (default), streaming, disk-rewrite, or in-memory
write_memory str DuckDB memory limit (e.g., '2GB', '512MB'). Auto-detected if not specified

See the Write Strategies Guide for detailed information on each strategy.

to_arrow()

Get the underlying PyArrow Table for interop with other Arrow-based tools.

arrow_table = table.to_arrow()

Spatial Partitioning Methods

All spatial partitioning methods support automatic resolution calculation via CLI (--auto flag). Python API currently requires explicit resolution specification; auto-resolution support is planned.

partition_by_quadkey(output_dir, resolution=13, partition_resolution=6, compression='ZSTD', hive=True, overwrite=False)

Partition the table into a Hive-partitioned directory by quadkey.

# Partition to a directory
stats = table.partition_by_quadkey('output/', resolution=12)
print(f"Created {stats['file_count']} files")

# With custom options
stats = table.partition_by_quadkey(
    'output/',
    partition_resolution=4,
    compression='SNAPPY',
    overwrite=True
)

partition_by_h3(output_dir, resolution=9, compression='ZSTD', hive=True, overwrite=False)

Partition the table into a Hive-partitioned directory by H3 cell.

# Partition by H3
stats = table.partition_by_h3('output/', resolution=6)
print(f"Created {stats['file_count']} files")

partition_by_s2(output_dir, level=13, compression='ZSTD', hive=True, overwrite=False)

Partition the table into a Hive-partitioned directory by S2 cell.

# Partition by S2
stats = table.partition_by_s2('output/', level=10)
print(f"Created {stats['file_count']} files")

partition_by_a5(output_dir, resolution=15, compression='ZSTD', hive=True, overwrite=False)

Partition the table into a Hive-partitioned directory by A5 (S2-based) cell.

# Partition by A5
stats = table.partition_by_a5('output/', resolution=12)
print(f"Created {stats['file_count']} files")

partition_by_string(output_dir, column, chars=None, hive=True, overwrite=False)

Partition by string column values or prefixes.

# Partition by full column values
stats = table.partition_by_string('output/', column='category')

# Partition by first 2 characters
stats = table.partition_by_string('output/', column='mgrs_code', chars=2)

partition_by_kdtree(output_dir, iterations=9, hive=True, overwrite=False)

Partition by KD-tree spatial cells.

# Default (512 partitions = 2^9)
stats = table.partition_by_kdtree('output/')

# 64 partitions (2^6)
stats = table.partition_by_kdtree('output/', iterations=6)

partition_by_admin(output_dir, dataset='gaul', levels=None, hive=True, overwrite=False)

Partition by administrative boundaries.

# Partition by country using GAUL dataset
stats = table.partition_by_admin('output/', dataset='gaul', levels=['country'])

# Multi-level hierarchical
stats = table.partition_by_admin(
    'output/',
    dataset='gaul',
    levels=['continent', 'country', 'department'],
    hive=True
)

Sub-Partitioning Utilities

For working with directories of partitioned files, gpio provides utilities to find and sub-partition large files.

find_large_files(directory, min_size_bytes, recursive=True)

Find parquet files exceeding a size threshold.

from geoparquet_io.core.sub_partition import find_large_files

# Find files over 100MB
large_files = find_large_files('/data/partitions/', min_size_bytes=100 * 1024 * 1024)
print(f"Found {len(large_files)} large files")
for file_path in large_files:
    print(f"  {file_path}")

Parameters: - directory (str): Directory to search - min_size_bytes (int): Minimum file size in bytes - recursive (bool): Search subdirectories (default: True)

Returns: List of file paths sorted by size (largest first)

sub_partition_directory(directory, partition_type, min_size_bytes, resolution=None, level=None, in_place=False, hive=False, overwrite=False, verbose=False, force=False, skip_analysis=True, compression='ZSTD', compression_level=15, auto=False, target_rows=100000, max_partitions=10000)

Sub-partition large files in a directory using spatial indexing.

from geoparquet_io.core.sub_partition import sub_partition_directory

# Sub-partition all H3-partitioned files over 100MB
result = sub_partition_directory(
    directory='/data/h3_partitions/',
    partition_type='h3',
    min_size_bytes=100 * 1024 * 1024,
    resolution=4,
    in_place=True,  # Replace originals
    verbose=True
)

print(f"Processed: {result['processed']}")
print(f"Errors: {len(result['errors'])}")

# Sub-partition S2 files with auto-resolution
result = sub_partition_directory(
    directory='/data/s2_partitions/',
    partition_type='s2',
    min_size_bytes=50 * 1024 * 1024,
    auto=True,
    target_rows=50000,
    skip_analysis=True  # Skip per-file analysis for speed
)

# Sub-partition quadkey files
result = sub_partition_directory(
    directory='/data/quadkey_partitions/',
    partition_type='quadkey',
    min_size_bytes=200 * 1024 * 1024,
    resolution=8,
    hive=True
)

Parameters: - directory (str): Directory containing parquet files - partition_type (str): Type of partition ("h3", "s2", "quadkey") - min_size_bytes (int): Minimum file size to process - resolution (int | None): Resolution for H3/quadkey (0-15 for H3) - level (int | None): Level for S2 (alias for resolution) - in_place (bool): Delete originals after successful sub-partition (default: False) - hive (bool): Use Hive-style partitioning (default: False) - overwrite (bool): Overwrite existing output directories (default: False) - verbose (bool): Print verbose output (default: False) - force (bool): Force operation even with warnings (default: False) - skip_analysis (bool): Skip partition analysis for performance (default: True) - compression (str): Compression codec (default: "ZSTD") - compression_level (int): Compression level (default: 15) - auto (bool): Auto-calculate resolution (default: False) - target_rows (int): Target rows per partition for auto mode (default: 100000) - max_partitions (int): Max partitions for auto mode (default: 10000)

Returns: Dictionary with keys: - processed (int): Number of files successfully processed - skipped (int): Number of files skipped (below threshold) - errors (list): List of dicts with keys file and error

Note: When auto=True, the function automatically calculates the best resolution based on data distribution. Use skip_analysis=True for faster batch processing when you trust the resolution settings.

add_admin_divisions(dataset='overture', levels=None, country_filter=None, use_centroid=False)

Add administrative division columns via spatial join.

# Add country codes
enriched = table.add_admin_divisions(
    dataset='overture',
    levels=['country']
)

# Add multiple levels with country filter
enriched = table.add_admin_divisions(
    dataset='gaul',
    levels=['continent', 'country', 'department'],
    country_filter='US'
)

add_bbox_metadata(bbox_column='bbox')

Add bbox covering metadata to the table schema.

# Add bbox column and metadata in one chain
table_with_bbox = table.add_bbox().add_bbox_metadata()

# Or add metadata to existing bbox column
table_with_meta = table.add_bbox_metadata()

check() / check_spatial() / check_compression() / check_bbox() / check_row_groups()

Run best-practice checks on the table.

# Run all checks
result = table.check()
if result.passed():
    print("All checks passed!")
else:
    for failure in result.failures():
        print(f"Failed: {failure}")

# Individual checks
spatial_result = table.check_spatial()
compression_result = table.check_compression()
bbox_result = table.check_bbox()
row_group_result = table.check_row_groups()

# Access results as dictionary
details = result.to_dict()

validate(version=None)

Validate against GeoParquet specification.

result = table.validate()
if result.passed():
    print(f"Valid GeoParquet {table.geoparquet_version}")

# Validate against specific version
result = table.validate(version='1.1')

upload(destination, compression='ZSTD', profile=None, s3_endpoint=None, ...)

Write and upload the table to cloud object storage (S3, GCS, Azure).

# Upload to S3
gpio.read('input.parquet') \
    .add_bbox() \
    .sort_hilbert() \
    .upload('s3://bucket/data.parquet')

# Upload with AWS profile
table.upload('s3://bucket/data.parquet', profile='my-aws-profile')

# Upload to S3-compatible storage (MinIO, source.coop)
table.upload(
    's3://bucket/data.parquet',
    s3_endpoint='minio.example.com:9000',
    s3_use_ssl=False
)

# Upload to GCS
table.upload('gs://bucket/data.parquet')

Converting Other Formats

Reading Other Formats (to GeoParquet)

Use gpio.convert() to load GeoPackage, Shapefile, GeoJSON, FlatGeobuf, or CSV files:

import geoparquet_io as gpio

# Convert GeoPackage
table = gpio.convert('data.gpkg')

# Convert Shapefile
table = gpio.convert('data.shp')

# Convert GeoJSON
table = gpio.convert('data.geojson')

# Convert CSV with WKT geometry
table = gpio.convert('data.csv', wkt_column='geometry')

# Convert CSV with lat/lon columns
table = gpio.convert('data.csv', lat_column='latitude', lon_column='longitude')

# Convert from S3 with authentication
table = gpio.convert('s3://bucket/data.gpkg', profile='my-aws')

Unlike the CLI convert command, the Python API does NOT apply Hilbert sorting by default. Chain .sort_hilbert() explicitly if you want spatial ordering:

# Full conversion workflow
gpio.convert('data.shp') \
    .add_bbox() \
    .sort_hilbert() \
    .write('output.parquet')

Writing to Other Formats (from GeoParquet)

The Table.write() method supports multiple output formats with automatic format detection:

import geoparquet_io as gpio

# Read GeoParquet
table = gpio.read('data.parquet')

# Write to different formats (auto-detected from extension)
table.write('output.gpkg')      # GeoPackage
table.write('output.fgb')       # FlatGeobuf
table.write('output.csv')       # CSV with WKT
table.write('output.shp')       # Shapefile
table.write('output.geojson')   # GeoJSON

# Or specify format explicitly
table.write('output.dat', format='csv')

Format-Specific Options

GeoPackage:

table.write('output.gpkg',
           layer_name='buildings',  # Custom layer name
           overwrite=True)          # Overwrite existing file

Shapefile:

table.write('output.shp',
           encoding='ISO-8859-1',  # Custom encoding (default: UTF-8)
           overwrite=True)

Shapefile Limitations

Shapefiles have significant limitations:

  • Column names truncated to 10 characters
  • File size limit of 2GB
  • Limited data type support
  • Creates multiple files (.shp, .shx, .dbf, .prj)

Consider using GeoPackage or FlatGeobuf for new projects.

CSV:

table.write('output.csv',
           include_wkt=True,    # Include WKT geometry (default)
           include_bbox=False)  # Exclude bbox column

GeoJSON:

table.write('output.geojson',
           precision=5,             # Coordinate precision (default: 7)
           write_bbox=True,         # Include bbox for each feature
           id_field='osm_id',       # Use field as feature ID
           pretty=True,             # Pretty-print JSON
           keep_crs=False)          # Reproject to WGS84 (default)

Using ops Functions for Format Conversion

For functional-style programming, use ops.convert_to_*() functions:

from geoparquet_io import ops
import pyarrow.parquet as pq

# Read Arrow table
table = pq.read_table('data.parquet')

# Convert to various formats
ops.convert_to_geopackage(table, 'output.gpkg', layer_name='features')
ops.convert_to_flatgeobuf(table, 'output.fgb')
ops.convert_to_csv(table, 'output.csv', include_wkt=True)
ops.convert_to_shapefile(table, 'output.shp', encoding='UTF-8')
ops.convert_to_geojson(table, 'output.geojson', precision=7)

Reading Partitioned Data

Use gpio.read_partition() to read Hive-partitioned datasets:

import geoparquet_io as gpio

# Read from a partitioned directory
table = gpio.read_partition('partitioned_output/')

# Read with glob pattern
table = gpio.read_partition('data/quadkey=*/*.parquet')

# Allow schema differences across partitions
table = gpio.read_partition('output/', allow_schema_diff=True)

Method Chaining

All transformation methods return a new Table, enabling fluent chains:

result = gpio.read('input.parquet') \
    .extract(limit=10000) \
    .add_bbox() \
    .add_quadkey(resolution=12) \
    .sort_hilbert()

result.write('output.parquet')

Pure Functions (ops module)

For integration with other Arrow workflows, use the ops module which provides pure functions:

import pyarrow.parquet as pq
from geoparquet_io.api import ops

# Read with PyArrow
table = pq.read_table('input.parquet')

# Apply transformations
table = ops.add_bbox(table)
table = ops.add_quadkey(table, resolution=12)
table = ops.sort_hilbert(table)

# Write with PyArrow
pq.write_table(table, 'output.parquet')

Note: pq.write_table() may not preserve all GeoParquet metadata (such as the geo key with CRS and geometry column info). For proper metadata preservation, wrap the result in Table(table).write('output.parquet') or use write_parquet_with_metadata() from geoparquet_io.core.common. The fluent API's .write() method is recommended.

Available Functions

Function Description
ops.add_bbox(table, column_name='bbox', geometry_column=None) Add bounding box column
ops.add_quadkey(table, column_name='quadkey', resolution=13, use_centroid=False, geometry_column=None) Add quadkey column
ops.add_h3(table, column_name='h3_cell', resolution=9, geometry_column=None) Add H3 cell column
ops.add_s2(table, column_name='s2_cell', level=13, geometry_column=None) Add S2 cell column
ops.add_kdtree(table, column_name='kdtree_cell', iterations=9, sample_size=100000, geometry_column=None) Add KD-tree cell column
ops.sort_hilbert(table, geometry_column=None) Reorder by Hilbert curve
ops.sort_column(table, column, descending=False) Sort by column(s)
ops.sort_quadkey(table, column_name='quadkey', resolution=13, use_centroid=False, remove_column=False) Sort by quadkey
ops.reproject(table, target_crs='EPSG:4326', source_crs=None, geometry_column=None) Reproject geometry
ops.extract(table, columns=None, exclude_columns=None, bbox=None, where=None, limit=None, geometry_column=None) Filter columns/rows
ops.read_bigquery(table_id, project=None, credentials_file=None, where=None, bbox=None, bbox_mode='auto', bbox_threshold=500000, limit=None, columns=None, exclude_columns=None) Read BigQuery table
ops.from_arcgis(service_url, token=None, where='1=1', bbox=None, include_cols=None, exclude_cols=None, limit=None) Fetch ArcGIS Feature Service

Pipeline Composition

Use pipe() to create reusable transformation pipelines:

from geoparquet_io.api import pipe, read

# Define a reusable pipeline
preprocess = pipe(
    lambda t: t.add_bbox(),
    lambda t: t.add_quadkey(resolution=12),
    lambda t: t.sort_hilbert(),
)

# Apply to any table
result = preprocess(read('input.parquet'))
result.write('output.parquet')

# Or with ops functions
from geoparquet_io.api import ops

transform = pipe(
    lambda t: ops.add_bbox(t),
    lambda t: ops.add_quadkey(t, resolution=10),
    lambda t: ops.extract(t, limit=1000),
)

import pyarrow.parquet as pq
table = pq.read_table('input.parquet')
result = transform(table)

Performance

The Python API provides the best performance because:

  1. No file I/O: Data stays in memory as Arrow tables
  2. Zero-copy: Arrow's columnar format enables efficient operations
  3. DuckDB backend: Spatial operations use DuckDB's optimized engine

Benchmark comparison (75MB file, 400K rows):

Approach Time Speedup
File-based CLI 34s baseline
Piped CLI 16s 53% faster
Python API 7s 78% faster

Integration with PyArrow

The API integrates seamlessly with PyArrow:

import pyarrow.parquet as pq
import geoparquet_io as gpio
from geoparquet_io.api import Table

# From PyArrow Table
arrow_table = pq.read_table('input.parquet')
table = Table(arrow_table)
result = table.add_bbox().sort_hilbert()

# To PyArrow Table
arrow_result = result.to_arrow()

# Use with PyArrow operations
filtered = arrow_result.filter(arrow_result['population'] > 1000)

Advanced: Direct Core Function Access

For power users who need direct access to core functions (e.g., for custom pipelines or when you need file-based operations without the Table wrapper):

from geoparquet_io.core.add_bbox_column import add_bbox_column
from geoparquet_io.core.hilbert_order import hilbert_order

# File-based operations
add_bbox_column(
    input_parquet="input.parquet",
    output_parquet="output.parquet",
    bbox_name="bbox",
    verbose=True
)

hilbert_order(
    input_parquet="input.parquet",
    output_parquet="sorted.parquet",
    geometry_column="geometry",
    add_bbox=True,
    verbose=True
)

See Core Functions Reference for all available functions.

Note: The fluent API (gpio.read()...) is recommended for most use cases as it provides better ergonomics and in-memory performance. The core API is primarily useful for:

  • Integrating with existing file-based pipelines
  • When you need fine-grained control over function parameters
  • Building custom tooling around gpio

Standalone Functions

STAC Generation

Generate and validate STAC (SpatioTemporal Asset Catalog) metadata:

from geoparquet_io import generate_stac, validate_stac

# Generate STAC Item for a single file
stac_path = generate_stac(
    'data.parquet',
    bucket='s3://my-bucket/data/'
)

# Generate STAC Collection for a directory
stac_path = generate_stac(
    'partitioned/',
    bucket='s3://my-bucket/data/',
    collection_id='my-dataset'
)

# With all options
stac_path = generate_stac(
    'data.parquet',
    output_path='custom.json',
    bucket='s3://my-bucket/data/',
    item_id='my-item',
    public_url='https://data.example.com/',
    overwrite=True,
    verbose=True
)

# Validate STAC
result = validate_stac('collection.json')
if result.passed():
    print("Valid STAC!")
else:
    for failure in result.failures():
        print(f"Issue: {failure}")

CheckResult Class

All check and validate methods return a CheckResult object:

from geoparquet_io import CheckResult

# Methods
result.passed()          # Returns True if all checks passed
result.failures()        # List of failure messages
result.warnings()        # List of warning messages
result.recommendations() # List of recommendations
result.to_dict()         # Full results as dictionary

# Can be used as boolean
if result:
    print("Passed!")

See Also