Remote Files¶

All commands work with remote URLs (s3://, gs://, az://, https://). Use them anywhere you'd use local paths.

How Remote Access Works¶

gpio uses different libraries for reads and writes:

Reads: All commands read remote files via DuckDB's httpfs extension. This supports S3, GCS, Azure, and HTTPS URLs transparently.
Writes: All commands write to remote destinations using obstore. When you specify a remote output path, gpio writes to a local temp file first, then uploads via obstore automatically.

The --aws-profile global flag is available on all commands for AWS authentication. See also --s3-endpoint, --s3-region, and --s3-no-ssl for S3-compatible storage.

gpio publish upload¶

For more control over uploads, use gpio publish upload which provides:

Parallel multipart uploads for large files
Custom S3-compatible endpoints (MinIO, Ceph, etc.)
Directory uploads with pattern filtering
Progress tracking and error handling options

For simple remote outputs, commands write directly. For batch uploads or S3-compatible storage, use gpio publish upload.

Authentication¶

geoparquet-io uses standard cloud provider authentication. Configure your credentials once using your cloud provider's standard tools - no CLI flags needed for basic usage.

AWS S3¶

Credentials are automatically discovered in this order:

Environment variables: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
AWS profile: ~/.aws/credentials via AWS_PROFILE env var or --aws-profile flag
IAM role: EC2/ECS/EKS instance metadata (when running on AWS infrastructure)

Examples:

CLIPython

# Use default credentials (from ~/.aws/credentials [default] or IAM role)
gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet

# Use environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet

# Use a named AWS profile (convenient CLI flag)
gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet --aws-profile production

# Or set AWS_PROFILE environment variable (equivalent to --aws-profile)
export AWS_PROFILE=production
gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet

import os
import geoparquet_io as gpio

# Use default credentials (from ~/.aws/credentials [default] or IAM role)
gpio.read('s3://bucket/input.parquet').add_bbox().write('output.parquet')

# Use a named AWS profile
gpio.read('s3://bucket/input.parquet').add_bbox().upload(
    's3://bucket/output.parquet',
    profile='production'
)

# Or set AWS_PROFILE environment variable
os.environ['AWS_PROFILE'] = 'production'
gpio.read('s3://bucket/input.parquet').add_bbox().upload('s3://bucket/output.parquet')

Note: The --aws-profile flag is available on all commands and sets AWS_PROFILE for you.

Azure Blob Storage¶

Azure credentials are discovered automatically when reading files:

# Set account credentials via environment variables
export AZURE_STORAGE_ACCOUNT_NAME=myaccount
export AZURE_STORAGE_ACCOUNT_KEY=mykey

# Or use SAS token
export AZURE_STORAGE_SAS_TOKEN=mytoken

# Then use Azure URLs
gpio add bbox az://container/input.parquet az://container/output.parquet

Note: Azure support for reads is currently limited. For full Azure support, process files locally.

Google Cloud Storage¶

GCS support requires HMAC keys (not service account JSON):

# Generate HMAC keys at: https://console.cloud.google.com/storage/settings
export GCS_ACCESS_KEY_ID=your_access_key
export GCS_SECRET_ACCESS_KEY=your_secret_key

gpio add bbox gs://bucket/input.parquet gs://bucket/output.parquet

Note: DuckDB's GCS support requires HMAC keys, which differs from standard GCP authentication. For writes, obstore can use service account JSON via GOOGLE_APPLICATION_CREDENTIALS. For reads, use HMAC keys or process files locally.

S3-Compatible Storage¶

All commands support S3-compatible endpoints (MinIO, Cloudflare R2, source.coop, Ceph) via global flags:

CLIPython

# Read from source.coop
gpio --s3-endpoint data.source.coop inspect summary s3://bucket/file.parquet

# MinIO without SSL
gpio --s3-endpoint minio.local:9000 --s3-no-ssl \
  extract geoparquet s3://bucket/input.parquet output.parquet

# Upload to custom endpoint
gpio --s3-endpoint storage.example.com --s3-region eu-west-1 \
  publish upload data.parquet s3://bucket/file.parquet

import geoparquet_io as gpio

# Read from source.coop
table = gpio.read_partition(
    's3://bucket/data/',
    s3_endpoint='data.source.coop'
)

# Upload to MinIO
gpio.read('data.parquet').upload(
    's3://bucket/file.parquet',
    s3_endpoint='minio.example.com:9000',
    s3_use_ssl=False
)

Environment Variables¶

Instead of flags, you can set standard AWS environment variables:

Variable	Equivalent Flag
`AWS_ENDPOINT_URL`	`--s3-endpoint`
`AWS_REGION` / `AWS_DEFAULT_REGION`	`--s3-region`
`AWS_PROFILE`	`--aws-profile`

export AWS_ENDPOINT_URL=https://data.source.coop
gpio inspect summary s3://bucket/file.parquet

SSL Detection¶

SSL is auto-detected from the endpoint URL:

http:// → SSL off
https:// or no scheme → SSL on
--s3-no-ssl overrides in either case

Piping to Upload¶

For efficient workflows, process data locally and pipe to upload. This uses Arrow IPC streaming with minimal overhead:

# Process and upload in one pipeline
gpio extract --bbox "-122.5,37.5,-122.0,38.0" input.parquet | \
  gpio add bbox - | \
  gpio sort hilbert - local_output.parquet && \
  gpio publish upload local_output.parquet s3://bucket/output.parquet --aws-profile prod

Or use the Python API for zero-copy streaming:

import geoparquet_io as gpio

# Process in memory, then upload
table = gpio.read('input.parquet') \
    .extract(bbox=(-122.5, 37.5, -122.0, 38.0)) \
    .add_bbox() \
    .sort_hilbert()

# Upload directly (writes temp file, uploads, cleans up)
table.upload('s3://bucket/output.parquet', profile='prod')

See Command Piping for more streaming patterns.

Exceptions¶

STAC generation (gpio publish stac) requires local files because asset paths reference local storage.

Notes¶

Remote writes use temporary local storage (~2× output file size required)
HTTPS wildcards (*.parquet) not supported
For very large files (>10 GB), consider processing locally for better performance
S3-compatible endpoints work with all commands via --s3-endpoint