Python API¶

gpio provides a fluent Python API for GeoParquet transformations. This API offers the best performance by keeping data in memory as Arrow tables, avoiding file I/O entirely.

Installation¶

CLIPython

pipx install geoparquet-io

pip install geoparquet-io
# or: uv add geoparquet-io

Quick Start¶

import geoparquet_io as gpio

# Read, transform, and write in a fluent chain
gpio.read('input.parquet') \
    .add_bbox() \
    .add_quadkey(resolution=12) \
    .sort_hilbert() \
    .write('output.parquet')

Reading Data¶

Use gpio.read() to load a GeoParquet file:

import geoparquet_io as gpio

# Read a file
table = gpio.read('places.parquet')

# Access properties
print(f"Rows: {table.num_rows}")
print(f"Columns: {table.column_names}")
print(f"Geometry column: {table.geometry_column}")

Reading from BigQuery¶

Use Table.from_bigquery() to read directly from BigQuery tables. The table_id parameter accepts fully-qualified "project.dataset.table" format or "dataset.table" when a separate project argument is provided (or when using your default gcloud project). When using bbox, provide coordinates as "minx,miny,maxx,maxy" representing longitude,latitude in EPSG:4326 degrees (e.g., "-122.52,37.70,-122.35,37.82").

import geoparquet_io as gpio

# Basic read
table = gpio.Table.from_bigquery('myproject.geodata.buildings')

# With filtering
table = gpio.Table.from_bigquery(
    'myproject.geodata.buildings',
    where="area_sqm > 1000",
    columns=['id', 'name', 'geography'],
    limit=10000
)

# With spatial filtering (bbox)
table = gpio.Table.from_bigquery(
    'myproject.geodata.buildings',
    bbox="-122.52,37.70,-122.35,37.82"
)

# With explicit credentials
table = gpio.Table.from_bigquery(
    'myproject.geodata.buildings',
    credentials_file='/path/to/service-account.json'
)

# Chain with other operations
gpio.Table.from_bigquery('myproject.geodata.buildings', limit=10000) \
    .add_bbox() \
    .sort_hilbert() \
    .write('output.parquet')

Bbox filtering modes:

When using bbox, control where filtering happens with bbox_mode:

# Server-side filtering (best for large tables)
table = gpio.Table.from_bigquery(
    'myproject.geodata.global_buildings',
    bbox="-122.52,37.70,-122.35,37.82",
    bbox_mode="server"
)

# Local filtering (best for small tables)
table = gpio.Table.from_bigquery(
    'myproject.geodata.city_parks',
    bbox="-122.52,37.70,-122.35,37.82",
    bbox_mode="local"
)

# Custom threshold for auto mode (default: 500000)
table = gpio.Table.from_bigquery(
    'myproject.geodata.buildings',
    bbox="-122.52,37.70,-122.35,37.82",
    bbox_threshold=100000  # Use server for tables > 100K rows
)

See the Extract Guide for detailed tradeoff analysis.

BigQuery Limitations

Cannot read views or external tables (Storage Read API limitation)
BIGNUMERIC columns are not supported

Reading from ArcGIS Feature Services¶

Use gpio.extract_arcgis() to download features from ArcGIS REST Feature Services. Server-side filtering is applied for efficient data transfer.

import geoparquet_io as gpio

# Basic read from public service
table = gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0'
)

# With server-side filtering
table = gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0',
    where="STATE_NAME = 'California'",
    bbox=(-122.5, 37.5, -122.0, 38.0),
    include_cols='NAME,POPULATION,STATE_NAME',
    limit=10000
)

# With authentication
table = gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0',
    token='your_arcgis_token'
)

# With username/password authentication
table = gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0',
    username='myuser',
    password='mypassword'
)

# Chain with other operations
gpio.extract_arcgis(
    'https://services.arcgis.com/.../FeatureServer/0',
    limit=10000
).add_bbox().sort_hilbert().write('output.parquet')

Parameters:

Parameter	Type	Description
`service_url`	str	ArcGIS Feature Service URL with layer ID
`token`	str	Direct authentication token
`token_file`	str	Path to file containing token
`username`	str	ArcGIS Online/Enterprise username
`password`	str	ArcGIS password (requires username)
`portal_url`	str	Enterprise portal URL for token generation
`where`	str	SQL WHERE clause (default: "1=1" = all)
`bbox`	tuple	Bounding box filter (xmin, ymin, xmax, ymax) in WGS84
`include_cols`	str	Comma-separated columns to include
`exclude_cols`	str	Comma-separated columns to exclude
`limit`	int	Maximum number of features

No automatic Hilbert sorting

Unlike the CLI gpio extract arcgis command, the Python API does NOT apply Hilbert sorting by default. Chain .sort_hilbert() explicitly if you want spatial ordering.

Table Class¶

The Table class wraps a PyArrow Table and provides chainable transformation methods.

Properties¶

Property	Description
`num_rows`	Number of rows in the table
`column_names`	List of column names
`geometry_column`	Name of the geometry column
`crs`	CRS as PROJJSON dict or string (None = OGC:CRS84 default)
`bounds`	Bounding box tuple (xmin, ymin, xmax, ymax)
`schema`	PyArrow Schema object
`geoparquet_version`	GeoParquet version string (e.g., "1.1")

table = gpio.read('data.parquet')

# Get CRS
print(table.crs)  # e.g., {'id': {'authority': 'EPSG', 'code': 4326}, ...}

# Get bounds
print(table.bounds)  # e.g., (-122.5, 37.5, -122.0, 38.0)

# Get schema
for field in table.schema:
    print(f"{field.name}: {field.type}")

Methods¶

`info(verbose=True)`¶

Print or return summary information about the table.

# Print formatted summary
table.info()
# Table: 766 rows, 6 columns
# Geometry: geometry
# CRS: EPSG:4326
# Bounds: [-122.500000, 37.500000, -122.000000, 38.000000]
# GeoParquet: 1.1

# Get as dictionary
info_dict = table.info(verbose=False)
print(info_dict['rows'])  # 766
print(info_dict['crs'])   # None or CRS dict

`head(n=10)` / `tail(n=10)`¶

Get the first or last N rows.

# First 10 rows (default)
first_rows = table.head()

# First 50 rows
first_50 = table.head(50)

# Last 10 rows (default)
last_rows = table.tail()

# Last 5 rows
last_5 = table.tail(5)

# Chain with other operations
preview = table.head(100).add_bbox()

`stats()`¶

Calculate column statistics.

stats = table.stats()

# Access stats for a column
print(stats['population']['min'])     # Minimum value
print(stats['population']['max'])     # Maximum value
print(stats['population']['nulls'])   # Null count
print(stats['population']['unique'])  # Approximate unique count

# Geometry columns have only null counts
print(stats['geometry']['nulls'])

`metadata(include_parquet_metadata=False)`¶

Get GeoParquet and schema metadata.

meta = table.metadata()

# Access metadata
print(meta['geoparquet_version'])  # e.g., '1.1.0'
print(meta['geometry_column'])     # e.g., 'geometry'
print(meta['crs'])                 # CRS dict or None
print(meta['bounds'])              # (xmin, ymin, xmax, ymax)
print(meta['columns'])             # List of column info dicts

# Full geo metadata from 'geo' key
geo_meta = meta.get('geo_metadata', {})

# Include raw Parquet schema metadata
full_meta = table.metadata(include_parquet_metadata=True)

`to_geojson(output_path=None, precision=7, write_bbox=False, id_field=None)`¶

Convert to GeoJSON.

# Write to file
table.to_geojson('output.geojson')

# With options
table.to_geojson('output.geojson', precision=5, write_bbox=True)

# Get as string (no file output)
geojson_str = table.to_geojson()

`add_bbox(column_name='bbox')`¶

Add a bounding box struct column computed from geometry.

table = gpio.read('input.parquet').add_bbox()
# or with custom name
table = gpio.read('input.parquet').add_bbox(column_name='bounds')

`add_quadkey(column_name='quadkey', resolution=13, use_centroid=False)`¶

Add a quadkey column based on geometry location.

# Default resolution (13)
table = gpio.read('input.parquet').add_quadkey()

# Custom resolution
table = gpio.read('input.parquet').add_quadkey(resolution=10)

# Force centroid calculation even if bbox exists
table = gpio.read('input.parquet').add_quadkey(use_centroid=True)

`add_h3(column_name='h3_cell', resolution=9)`¶

Add an H3 hexagonal cell column based on geometry location.

# Default resolution (9, ~100m cells)
table = gpio.read('input.parquet').add_h3()

# Lower resolution for larger cells
table = gpio.read('input.parquet').add_h3(resolution=6)

# Custom column name
table = gpio.read('input.parquet').add_h3(column_name='hex_id', resolution=8)

`add_s2(column_name='s2_cell', level=13)`¶

Add an S2 spherical cell column based on geometry location.

# Default level (13, ~1.2 km² cells)
table = gpio.read('input.parquet').add_s2()

# Lower level for larger cells
table = gpio.read('input.parquet').add_s2(level=10)

# Custom column name
table = gpio.read('input.parquet').add_s2(column_name='s2_index', level=15)

`add_kdtree(column_name='kdtree_cell', iterations=9, sample_size=100000)`¶

Add a KD-tree cell column for data-adaptive spatial partitioning.

# Default settings (512 partitions = 2^9)
table = gpio.read('input.parquet').add_kdtree()

# Fewer partitions
table = gpio.read('input.parquet').add_kdtree(iterations=6)  # 64 partitions

# More partitions with larger sample
table = gpio.read('input.parquet').add_kdtree(iterations=12, sample_size=500000)

`sort_hilbert()`¶

Reorder rows using Hilbert curve ordering for better spatial locality.

table = gpio.read('input.parquet').sort_hilbert()

`sort_column(column_name, descending=False)`¶

Sort rows by a specified column.

# Sort by name ascending
table = gpio.read('input.parquet').sort_column('name')

# Sort by population descending
table = gpio.read('input.parquet').sort_column('population', descending=True)

`sort_quadkey(column_name='quadkey', resolution=13, use_centroid=False, remove_column=False)`¶

Sort rows by quadkey for spatial locality. If no quadkey column exists, one is added automatically.

# Sort by quadkey (auto-adds column if needed)
table = gpio.read('input.parquet').sort_quadkey()

# Sort and remove the quadkey column afterward
table = gpio.read('input.parquet').sort_quadkey(remove_column=True)

# Use existing quadkey column
table = gpio.read('input.parquet').sort_quadkey(column_name='my_quadkey')

`reproject(target_crs='EPSG:4326', source_crs=None)`¶

Reproject geometry to a different coordinate reference system.

# Reproject to WGS84 (auto-detects source CRS from metadata)
table = gpio.read('input.parquet').reproject(target_crs='EPSG:4326')

# Reproject with explicit source CRS
table = gpio.read('input.parquet').reproject(
    target_crs='EPSG:3857',
    source_crs='EPSG:4326'
)

`extract(columns=None, exclude_columns=None, bbox=None, where=None, limit=None)`¶

Filter columns and rows.

# Select specific columns
table = gpio.read('input.parquet').extract(columns=['name', 'address'])

# Exclude columns
table = gpio.read('input.parquet').extract(exclude_columns=['temp_id'])

# Limit rows
table = gpio.read('input.parquet').extract(limit=1000)

# Spatial filter
table = gpio.read('input.parquet').extract(bbox=(-122.5, 37.5, -122.0, 38.0))

# SQL WHERE clause
table = gpio.read('input.parquet').extract(where="population > 10000")

`write(path, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, write_strategy=None, write_memory=None)`¶

Write the table to a GeoParquet file. Returns the output Path for chaining or confirmation.

# Basic write
path = table.write('output.parquet')
print(f"Wrote to {path}")

# With compression options
table.write('output.parquet', compression='GZIP', compression_level=6)

# With row group size
table.write('output.parquet', row_group_size_mb=128)

Write Strategy Options

For large files, you can control memory usage with write strategies:

# Use streaming strategy (constant memory usage)
table.write('output.parquet', write_strategy='streaming')

# Limit DuckDB memory for containerized environments
table.write('output.parquet', write_memory='512MB')

# Combine both options
table.write('output.parquet', write_strategy='streaming', write_memory='1GB')

Parameter	Type	Description
`write_strategy`	str	Write strategy: `duckdb-kv` (default), `streaming`, `disk-rewrite`, or `in-memory`
`write_memory`	str	DuckDB memory limit (e.g., `'2GB'`, `'512MB'`). Auto-detected if not specified

See the Write Strategies Guide for detailed information on each strategy.

`to_arrow()`¶

Get the underlying PyArrow Table for interop with other Arrow-based tools.

arrow_table = table.to_arrow()

Spatial Partitioning Methods¶

All spatial partitioning methods support automatic resolution calculation via CLI (--auto flag). Python API currently requires explicit resolution specification; auto-resolution support is planned.

`partition_by_quadkey(output_dir, resolution=13, partition_resolution=6, compression='ZSTD', hive=True, overwrite=False)`¶

Partition the table into a Hive-partitioned directory by quadkey.

# Partition to a directory
stats = table.partition_by_quadkey('output/', resolution=12)
print(f"Created {stats['file_count']} files")

# With custom options
stats = table.partition_by_quadkey(
    'output/',
    partition_resolution=4,
    compression='SNAPPY',
    overwrite=True
)

`partition_by_h3(output_dir, resolution=9, compression='ZSTD', hive=True, overwrite=False)`¶

Partition the table into a Hive-partitioned directory by H3 cell.

# Partition by H3
stats = table.partition_by_h3('output/', resolution=6)
print(f"Created {stats['file_count']} files")

`partition_by_s2(output_dir, level=13, compression='ZSTD', hive=True, overwrite=False)`¶

Partition the table into a Hive-partitioned directory by S2 cell.

# Partition by S2
stats = table.partition_by_s2('output/', level=10)
print(f"Created {stats['file_count']} files")

`partition_by_a5(output_dir, resolution=15, compression='ZSTD', hive=True, overwrite=False)`¶

Partition the table into a Hive-partitioned directory by A5 (S2-based) cell.

# Partition by A5
stats = table.partition_by_a5('output/', resolution=12)
print(f"Created {stats['file_count']} files")

`partition_by_string(output_dir, column, chars=None, hive=True, overwrite=False)`¶

Partition by string column values or prefixes.

# Partition by full column values
stats = table.partition_by_string('output/', column='category')

# Partition by first 2 characters
stats = table.partition_by_string('output/', column='mgrs_code', chars=2)

`partition_by_kdtree(output_dir, iterations=9, hive=True, overwrite=False)`¶

Partition by KD-tree spatial cells.

# Default (512 partitions = 2^9)
stats = table.partition_by_kdtree('output/')

# 64 partitions (2^6)
stats = table.partition_by_kdtree('output/', iterations=6)

`partition_by_admin(output_dir, dataset='gaul', levels=None, hive=True, overwrite=False)`¶

Partition by administrative boundaries.

# Partition by country using GAUL dataset
stats = table.partition_by_admin('output/', dataset='gaul', levels=['country'])

# Multi-level hierarchical
stats = table.partition_by_admin(
    'output/',
    dataset='gaul',
    levels=['continent', 'country', 'department'],
    hive=True
)

Sub-Partitioning Utilities¶

For working with directories of partitioned files, gpio provides utilities to find and sub-partition large files.

`find_large_files(directory, min_size_bytes, recursive=True)`¶

Find parquet files exceeding a size threshold.

from geoparquet_io.core.sub_partition import find_large_files

# Find files over 100MB
large_files = find_large_files('/data/partitions/', min_size_bytes=100 * 1024 * 1024)
print(f"Found {len(large_files)} large files")
for file_path in large_files:
    print(f"  {file_path}")

Parameters: - directory (str): Directory to search - min_size_bytes (int): Minimum file size in bytes - recursive (bool): Search subdirectories (default: True)

Returns: List of file paths sorted by size (largest first)

`sub_partition_directory(directory, partition_type, min_size_bytes, resolution=None, level=None, in_place=False, hive=False, overwrite=False, verbose=False, force=False, skip_analysis=True, compression='ZSTD', compression_level=15, auto=False, target_rows=100000, max_partitions=10000)`¶

Sub-partition large files in a directory using spatial indexing.

from geoparquet_io.core.sub_partition import sub_partition_directory

# Sub-partition all H3-partitioned files over 100MB
result = sub_partition_directory(
    directory='/data/h3_partitions/',
    partition_type='h3',
    min_size_bytes=100 * 1024 * 1024,
    resolution=4,
    in_place=True,  # Replace originals
    verbose=True
)

print(f"Processed: {result['processed']}")
print(f"Errors: {len(result['errors'])}")

# Sub-partition S2 files with auto-resolution
result = sub_partition_directory(
    directory='/data/s2_partitions/',
    partition_type='s2',
    min_size_bytes=50 * 1024 * 1024,
    auto=True,
    target_rows=50000,
    skip_analysis=True  # Skip per-file analysis for speed
)

# Sub-partition quadkey files
result = sub_partition_directory(
    directory='/data/quadkey_partitions/',
    partition_type='quadkey',
    min_size_bytes=200 * 1024 * 1024,
    resolution=8,
    hive=True
)

Parameters: - directory (str): Directory containing parquet files - partition_type (str): Type of partition ("h3", "s2", "quadkey") - min_size_bytes (int): Minimum file size to process - resolution (int | None): Resolution for H3/quadkey (0-15 for H3) - level (int | None): Level for S2 (alias for resolution) - in_place (bool): Delete originals after successful sub-partition (default: False) - hive (bool): Use Hive-style partitioning (default: False) - overwrite (bool): Overwrite existing output directories (default: False) - verbose (bool): Print verbose output (default: False) - force (bool): Force operation even with warnings (default: False) - skip_analysis (bool): Skip partition analysis for performance (default: True) - compression (str): Compression codec (default: "ZSTD") - compression_level (int): Compression level (default: 15) - auto (bool): Auto-calculate resolution (default: False) - target_rows (int): Target rows per partition for auto mode (default: 100000) - max_partitions (int): Max partitions for auto mode (default: 10000)

Returns: Dictionary with keys: - processed (int): Number of files successfully processed - skipped (int): Number of files skipped (below threshold) - errors (list): List of dicts with keys file and error

Note: When auto=True, the function automatically calculates the best resolution based on data distribution. Use skip_analysis=True for faster batch processing when you trust the resolution settings.

`add_admin_divisions(dataset='overture', levels=None, country_filter=None, use_centroid=False)`¶

Add administrative division columns via spatial join.

# Add country codes
enriched = table.add_admin_divisions(
    dataset='overture',
    levels=['country']
)

# Add multiple levels with country filter
enriched = table.add_admin_divisions(
    dataset='gaul',
    levels=['continent', 'country', 'department'],
    country_filter='US'
)

`add_bbox_metadata(bbox_column='bbox')`¶

Add bbox covering metadata to the table schema.

# Add bbox column and metadata in one chain
table_with_bbox = table.add_bbox().add_bbox_metadata()

# Or add metadata to existing bbox column
table_with_meta = table.add_bbox_metadata()

`check()` / `check_spatial()` / `check_compression()` / `check_bbox()` / `check_row_groups()`¶

Run best-practice checks on the table.

# Run all checks
result = table.check()
if result.passed():
    print("All checks passed!")
else:
    for failure in result.failures():
        print(f"Failed: {failure}")

# Individual checks
spatial_result = table.check_spatial()
compression_result = table.check_compression()
bbox_result = table.check_bbox()
row_group_result = table.check_row_groups()

# Access results as dictionary
details = result.to_dict()

`validate(version=None)`¶

Validate against GeoParquet specification.

result = table.validate()
if result.passed():
    print(f"Valid GeoParquet {table.geoparquet_version}")

# Validate against specific version
result = table.validate(version='1.1')

`upload(destination, compression='ZSTD', profile=None, s3_endpoint=None, ...)`¶

Write and upload the table to cloud object storage (S3, GCS, Azure).

# Upload to S3
gpio.read('input.parquet') \
    .add_bbox() \
    .sort_hilbert() \
    .upload('s3://bucket/data.parquet')

# Upload with AWS profile
table.upload('s3://bucket/data.parquet', profile='my-aws-profile')

# Upload to S3-compatible storage (MinIO, source.coop)
table.upload(
    's3://bucket/data.parquet',
    s3_endpoint='minio.example.com:9000',
    s3_use_ssl=False
)

# Upload to GCS
table.upload('gs://bucket/data.parquet')

Converting Other Formats¶

Reading Other Formats (to GeoParquet)¶

Use gpio.convert() to load GeoPackage, Shapefile, GeoJSON, FlatGeobuf, or CSV files:

import geoparquet_io as gpio

# Convert GeoPackage
table = gpio.convert('data.gpkg')

# Convert Shapefile
table = gpio.convert('data.shp')

# Convert GeoJSON
table = gpio.convert('data.geojson')

# Convert CSV with WKT geometry
table = gpio.convert('data.csv', wkt_column='geometry')

# Convert CSV with lat/lon columns
table = gpio.convert('data.csv', lat_column='latitude', lon_column='longitude')

# Convert from S3 with authentication
table = gpio.convert('s3://bucket/data.gpkg', profile='my-aws')

Unlike the CLI convert command, the Python API does NOT apply Hilbert sorting by default. Chain .sort_hilbert() explicitly if you want spatial ordering:

# Full conversion workflow
gpio.convert('data.shp') \
    .add_bbox() \
    .sort_hilbert() \
    .write('output.parquet')

Writing to Other Formats (from GeoParquet)¶

The Table.write() method supports multiple output formats with automatic format detection:

import geoparquet_io as gpio

# Read GeoParquet
table = gpio.read('data.parquet')

# Write to different formats (auto-detected from extension)
table.write('output.gpkg')      # GeoPackage
table.write('output.fgb')       # FlatGeobuf
table.write('output.csv')       # CSV with WKT
table.write('output.shp')       # Shapefile
table.write('output.geojson')   # GeoJSON

# Or specify format explicitly
table.write('output.dat', format='csv')

Format-Specific Options¶

GeoPackage:

table.write('output.gpkg',
           layer_name='buildings',  # Custom layer name
           overwrite=True)          # Overwrite existing file

Shapefile:

table.write('output.shp',
           encoding='ISO-8859-1',  # Custom encoding (default: UTF-8)
           overwrite=True)

Shapefile Limitations

Shapefiles have significant limitations:

Column names truncated to 10 characters
File size limit of 2GB
Limited data type support
Creates multiple files (.shp, .shx, .dbf, .prj)

Consider using GeoPackage or FlatGeobuf for new projects.

CSV:

table.write('output.csv',
           include_wkt=True,    # Include WKT geometry (default)
           include_bbox=False)  # Exclude bbox column

GeoJSON:

table.write('output.geojson',
           precision=5,             # Coordinate precision (default: 7)
           write_bbox=True,         # Include bbox for each feature
           id_field='osm_id',       # Use field as feature ID
           pretty=True,             # Pretty-print JSON
           keep_crs=False)          # Reproject to WGS84 (default)

Using ops Functions for Format Conversion¶

For functional-style programming, use ops.convert_to_*() functions:

from geoparquet_io import ops
import pyarrow.parquet as pq

# Read Arrow table
table = pq.read_table('data.parquet')

# Convert to various formats
ops.convert_to_geopackage(table, 'output.gpkg', layer_name='features')
ops.convert_to_flatgeobuf(table, 'output.fgb')
ops.convert_to_csv(table, 'output.csv', include_wkt=True)
ops.convert_to_shapefile(table, 'output.shp', encoding='UTF-8')
ops.convert_to_geojson(table, 'output.geojson', precision=7)

Reading Partitioned Data¶

Use gpio.read_partition() to read Hive-partitioned datasets:

import geoparquet_io as gpio

# Read from a partitioned directory
table = gpio.read_partition('partitioned_output/')

# Read with glob pattern
table = gpio.read_partition('data/quadkey=*/*.parquet')

# Allow schema differences across partitions
table = gpio.read_partition('output/', allow_schema_diff=True)

Method Chaining¶

All transformation methods return a new Table, enabling fluent chains:

result = gpio.read('input.parquet') \
    .extract(limit=10000) \
    .add_bbox() \
    .add_quadkey(resolution=12) \
    .sort_hilbert()

result.write('output.parquet')

Pure Functions (ops module)¶

For integration with other Arrow workflows, use the ops module which provides pure functions:

import pyarrow.parquet as pq
from geoparquet_io.api import ops

# Read with PyArrow
table = pq.read_table('input.parquet')

# Apply transformations
table = ops.add_bbox(table)
table = ops.add_quadkey(table, resolution=12)
table = ops.sort_hilbert(table)

# Write with PyArrow
pq.write_table(table, 'output.parquet')

Note: pq.write_table() may not preserve all GeoParquet metadata (such as the geo key with CRS and geometry column info). For proper metadata preservation, wrap the result in Table(table).write('output.parquet') or use write_parquet_with_metadata() from geoparquet_io.core.common. The fluent API's .write() method is recommended.

Available Functions¶

Function	Description
`ops.add_bbox(table, column_name='bbox', geometry_column=None)`	Add bounding box column
`ops.add_quadkey(table, column_name='quadkey', resolution=13, use_centroid=False, geometry_column=None)`	Add quadkey column
`ops.add_h3(table, column_name='h3_cell', resolution=9, geometry_column=None)`	Add H3 cell column
`ops.add_s2(table, column_name='s2_cell', level=13, geometry_column=None)`	Add S2 cell column
`ops.add_kdtree(table, column_name='kdtree_cell', iterations=9, sample_size=100000, geometry_column=None)`	Add KD-tree cell column
`ops.sort_hilbert(table, geometry_column=None)`	Reorder by Hilbert curve
`ops.sort_column(table, column, descending=False)`	Sort by column(s)
`ops.sort_quadkey(table, column_name='quadkey', resolution=13, use_centroid=False, remove_column=False)`	Sort by quadkey
`ops.reproject(table, target_crs='EPSG:4326', source_crs=None, geometry_column=None)`	Reproject geometry
`ops.extract(table, columns=None, exclude_columns=None, bbox=None, where=None, limit=None, geometry_column=None)`	Filter columns/rows
`ops.read_bigquery(table_id, project=None, credentials_file=None, where=None, bbox=None, bbox_mode='auto', bbox_threshold=500000, limit=None, columns=None, exclude_columns=None)`	Read BigQuery table
`ops.from_arcgis(service_url, token=None, where='1=1', bbox=None, include_cols=None, exclude_cols=None, limit=None)`	Fetch ArcGIS Feature Service

Pipeline Composition¶

Use pipe() to create reusable transformation pipelines:

from geoparquet_io.api import pipe, read

# Define a reusable pipeline
preprocess = pipe(
    lambda t: t.add_bbox(),
    lambda t: t.add_quadkey(resolution=12),
    lambda t: t.sort_hilbert(),
)

# Apply to any table
result = preprocess(read('input.parquet'))
result.write('output.parquet')

# Or with ops functions
from geoparquet_io.api import ops

transform = pipe(
    lambda t: ops.add_bbox(t),
    lambda t: ops.add_quadkey(t, resolution=10),
    lambda t: ops.extract(t, limit=1000),
)

import pyarrow.parquet as pq
table = pq.read_table('input.parquet')
result = transform(table)

Performance¶

The Python API provides the best performance because:

No file I/O: Data stays in memory as Arrow tables
Zero-copy: Arrow's columnar format enables efficient operations
DuckDB backend: Spatial operations use DuckDB's optimized engine

Benchmark comparison (75MB file, 400K rows):

Approach	Time	Speedup
File-based CLI	34s	baseline
Piped CLI	16s	53% faster
Python API	7s	78% faster

Integration with PyArrow¶

The API integrates seamlessly with PyArrow:

import pyarrow.parquet as pq
import geoparquet_io as gpio
from geoparquet_io.api import Table

# From PyArrow Table
arrow_table = pq.read_table('input.parquet')
table = Table(arrow_table)
result = table.add_bbox().sort_hilbert()

# To PyArrow Table
arrow_result = result.to_arrow()

# Use with PyArrow operations
filtered = arrow_result.filter(arrow_result['population'] > 1000)

Advanced: Direct Core Function Access¶

For power users who need direct access to core functions (e.g., for custom pipelines or when you need file-based operations without the Table wrapper):

from geoparquet_io.core.add_bbox_column import add_bbox_column
from geoparquet_io.core.hilbert_order import hilbert_order

# File-based operations
add_bbox_column(
    input_parquet="input.parquet",
    output_parquet="output.parquet",
    bbox_name="bbox",
    verbose=True
)

hilbert_order(
    input_parquet="input.parquet",
    output_parquet="sorted.parquet",
    geometry_column="geometry",
    add_bbox=True,
    verbose=True
)

See Core Functions Reference for all available functions.

Note: The fluent API (gpio.read()...) is recommended for most use cases as it provides better ergonomics and in-memory performance. The core API is primarily useful for:

Integrating with existing file-based pipelines

When you need fine-grained control over function parameters

Building custom tooling around gpio

Standalone Functions¶

STAC Generation¶

Generate and validate STAC (SpatioTemporal Asset Catalog) metadata:

from geoparquet_io import generate_stac, validate_stac

# Generate STAC Item for a single file
stac_path = generate_stac(
    'data.parquet',
    bucket='s3://my-bucket/data/'
)

# Generate STAC Collection for a directory
stac_path = generate_stac(
    'partitioned/',
    bucket='s3://my-bucket/data/',
    collection_id='my-dataset'
)

# With all options
stac_path = generate_stac(
    'data.parquet',
    output_path='custom.json',
    bucket='s3://my-bucket/data/',
    item_id='my-item',
    public_url='https://data.example.com/',
    overwrite=True,
    verbose=True
)

# Validate STAC
result = validate_stac('collection.json')
if result.passed():
    print("Valid STAC!")
else:
    for failure in result.failures():
        print(f"Issue: {failure}")

CheckResult Class¶

All check and validate methods return a CheckResult object:

from geoparquet_io import CheckResult

# Methods
result.passed()          # Returns True if all checks passed
result.failures()        # List of failure messages
result.warnings()        # List of warning messages
result.recommendations() # List of recommendations
result.to_dict()         # Full results as dictionary

# Can be used as boolean
if result:
    print("Passed!")

Python API¶

Installation¶

Quick Start¶

Reading Data¶

Reading from BigQuery¶

Reading from ArcGIS Feature Services¶

Table Class¶

Properties¶

Methods¶

info(verbose=True)¶

head(n=10) / tail(n=10)¶

stats()¶

metadata(include_parquet_metadata=False)¶

to_geojson(output_path=None, precision=7, write_bbox=False, id_field=None)¶

add_bbox(column_name='bbox')¶

add_quadkey(column_name='quadkey', resolution=13, use_centroid=False)¶

add_h3(column_name='h3_cell', resolution=9)¶

add_s2(column_name='s2_cell', level=13)¶

add_kdtree(column_name='kdtree_cell', iterations=9, sample_size=100000)¶

sort_hilbert()¶

sort_column(column_name, descending=False)¶

sort_quadkey(column_name='quadkey', resolution=13, use_centroid=False, remove_column=False)¶

reproject(target_crs='EPSG:4326', source_crs=None)¶

extract(columns=None, exclude_columns=None, bbox=None, where=None, limit=None)¶

write(path, compression='ZSTD', compression_level=None, row_group_size_mb=None, row_group_rows=None, write_strategy=None, write_memory=None)¶

to_arrow()¶