Remote Files¶
All commands work with remote URLs (s3://, gs://, az://, https://). Use them anywhere you'd use local paths.
How Remote Access Works¶
gpio uses different libraries for reads and writes:
-
Reads: All commands read remote files via DuckDB's httpfs extension. This supports S3, GCS, Azure, and HTTPS URLs transparently.
-
Writes: All commands write to remote destinations using obstore. When you specify a remote output path, gpio writes to a local temp file first, then uploads via obstore automatically.
The --aws-profile global flag is available on all commands for AWS authentication. See also --s3-endpoint, --s3-region, and --s3-no-ssl for S3-compatible storage.
gpio publish upload¶
For more control over uploads, use gpio publish upload which provides:
- Parallel multipart uploads for large files
- Custom S3-compatible endpoints (MinIO, Ceph, etc.)
- Directory uploads with pattern filtering
- Progress tracking and error handling options
For simple remote outputs, commands write directly. For batch uploads or S3-compatible storage, use gpio publish upload.
Authentication¶
geoparquet-io uses standard cloud provider authentication. Configure your credentials once using your cloud provider's standard tools - no CLI flags needed for basic usage.
AWS S3¶
Credentials are automatically discovered in this order:
- Environment variables:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY - AWS profile:
~/.aws/credentialsviaAWS_PROFILEenv var or--aws-profileflag - IAM role: EC2/ECS/EKS instance metadata (when running on AWS infrastructure)
Examples:
# Use default credentials (from ~/.aws/credentials [default] or IAM role)
gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet
# Use environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet
# Use a named AWS profile (convenient CLI flag)
gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet --aws-profile production
# Or set AWS_PROFILE environment variable (equivalent to --aws-profile)
export AWS_PROFILE=production
gpio add bbox s3://bucket/input.parquet s3://bucket/output.parquet
import os
import geoparquet_io as gpio
# Use default credentials (from ~/.aws/credentials [default] or IAM role)
gpio.read('s3://bucket/input.parquet').add_bbox().write('output.parquet')
# Use a named AWS profile
gpio.read('s3://bucket/input.parquet').add_bbox().upload(
's3://bucket/output.parquet',
profile='production'
)
# Or set AWS_PROFILE environment variable
os.environ['AWS_PROFILE'] = 'production'
gpio.read('s3://bucket/input.parquet').add_bbox().upload('s3://bucket/output.parquet')
Note: The --aws-profile flag is available on all commands and sets AWS_PROFILE for you.
Azure Blob Storage¶
Azure credentials are discovered automatically when reading files:
# Set account credentials via environment variables
export AZURE_STORAGE_ACCOUNT_NAME=myaccount
export AZURE_STORAGE_ACCOUNT_KEY=mykey
# Or use SAS token
export AZURE_STORAGE_SAS_TOKEN=mytoken
# Then use Azure URLs
gpio add bbox az://container/input.parquet az://container/output.parquet
Note: Azure support for reads is currently limited. For full Azure support, process files locally.
Google Cloud Storage¶
GCS support requires HMAC keys (not service account JSON):
# Generate HMAC keys at: https://console.cloud.google.com/storage/settings
export GCS_ACCESS_KEY_ID=your_access_key
export GCS_SECRET_ACCESS_KEY=your_secret_key
gpio add bbox gs://bucket/input.parquet gs://bucket/output.parquet
Note: DuckDB's GCS support requires HMAC keys, which differs from standard GCP authentication. For writes, obstore can use service account JSON via GOOGLE_APPLICATION_CREDENTIALS. For reads, use HMAC keys or process files locally.
S3-Compatible Storage¶
All commands support S3-compatible endpoints (MinIO, Cloudflare R2, source.coop, Ceph) via global flags:
# Read from source.coop
gpio --s3-endpoint data.source.coop inspect summary s3://bucket/file.parquet
# MinIO without SSL
gpio --s3-endpoint minio.local:9000 --s3-no-ssl \
extract geoparquet s3://bucket/input.parquet output.parquet
# Upload to custom endpoint
gpio --s3-endpoint storage.example.com --s3-region eu-west-1 \
publish upload data.parquet s3://bucket/file.parquet
import geoparquet_io as gpio
# Read from source.coop
table = gpio.read_partition(
's3://bucket/data/',
s3_endpoint='data.source.coop'
)
# Upload to MinIO
gpio.read('data.parquet').upload(
's3://bucket/file.parquet',
s3_endpoint='minio.example.com:9000',
s3_use_ssl=False
)
Environment Variables¶
Instead of flags, you can set standard AWS environment variables:
| Variable | Equivalent Flag |
|---|---|
AWS_ENDPOINT_URL |
--s3-endpoint |
AWS_REGION / AWS_DEFAULT_REGION |
--s3-region |
AWS_PROFILE |
--aws-profile |
export AWS_ENDPOINT_URL=https://data.source.coop
gpio inspect summary s3://bucket/file.parquet
SSL Detection¶
SSL is auto-detected from the endpoint URL:
http://→ SSL offhttps://or no scheme → SSL on--s3-no-ssloverrides in either case
Piping to Upload¶
For efficient workflows, process data locally and pipe to upload. This uses Arrow IPC streaming with minimal overhead:
# Process and upload in one pipeline
gpio extract --bbox "-122.5,37.5,-122.0,38.0" input.parquet | \
gpio add bbox - | \
gpio sort hilbert - local_output.parquet && \
gpio publish upload local_output.parquet s3://bucket/output.parquet --aws-profile prod
Or use the Python API for zero-copy streaming:
import geoparquet_io as gpio
# Process in memory, then upload
table = gpio.read('input.parquet') \
.extract(bbox=(-122.5, 37.5, -122.0, 38.0)) \
.add_bbox() \
.sort_hilbert()
# Upload directly (writes temp file, uploads, cleans up)
table.upload('s3://bucket/output.parquet', profile='prod')
See Command Piping for more streaming patterns.
Exceptions¶
STAC generation (gpio publish stac) requires local files because asset paths reference local storage.
Notes¶
- Remote writes use temporary local storage (~2× output file size required)
- HTTPS wildcards (
*.parquet) not supported - For very large files (>10 GB), consider processing locally for better performance
- S3-compatible endpoints work with all commands via
--s3-endpoint