extract Command¶
For detailed usage and examples, see the Extract User Guide.
Quick Reference¶
gpio extract --help
This will show all available options for the extract command.
Subcommands¶
The extract command supports multiple data sources:
extract geoparquet (default)¶
Extract from GeoParquet files. This is the default when no subcommand is specified.
gpio extract input.parquet output.parquet --bbox -122,37,-121,38
gpio extract geoparquet input.parquet output.parquet # Explicit
extract bigquery¶
Extract from BigQuery tables to GeoParquet.
gpio extract bigquery PROJECT.DATASET.TABLE output.parquet
Options:
--project- GCP project ID (overrides project in TABLE_ID)--credentials-file- Path to service account JSON file--include-cols- Comma-separated columns to include--exclude-cols- Comma-separated columns to exclude--where- SQL WHERE clause (BigQuery SQL syntax)--bbox- Bounding box filter:minx,miny,maxx,maxy--bbox-mode- Filter mode:auto(default),server, orlocal--bbox-threshold- Row count threshold for auto mode (default: 500000)--limit- Maximum rows to extract--geography-column- GEOGRAPHY column name (auto-detected if not set)--dry-run- Show SQL without executing--show-sql- Print SQL during execution
Authentication (in order of precedence):
--credentials-file: Path to service account JSONGOOGLE_APPLICATION_CREDENTIALSenvironment variablegcloud auth application-defaultcredentials
Bbox Filtering Modes:
The --bbox-mode option controls where spatial filtering occurs:
auto(default): Uses table row count to decide. Tables ≥ threshold use server-side, smaller tables use localserver: Always push filter to BigQuery usingST_INTERSECTS()- best for large tableslocal: Always filter locally in DuckDB after fetch - best for small tables
The --bbox-threshold sets the row count where auto switches to server-side filtering. Default is 500,000 rows. See the User Guide for detailed tradeoff analysis.
Limitations
Cannot read BigQuery views or external tables - this is a limitation of the BigQuery Storage Read API. BIGNUMERIC columns are not supported (exceeds DuckDB's precision).
Options¶
Column Selection¶
--include-cols COLS- Comma-separated columns to include (geometry and bbox auto-added unless in --exclude-cols)--exclude-cols COLS- Comma-separated columns to exclude (can be used with --include-cols to exclude geometry/bbox)
Spatial Filtering¶
--bbox BBOX- Bounding box filter:xmin,ymin,xmax,ymax--geometry GEOM- Geometry filter: GeoJSON, WKT, @filepath, or - for stdin--use-first-geometry- Use first geometry if FeatureCollection contains multiple
SQL Filtering¶
--where CLAUSE- DuckDB WHERE clause for filtering rows--limit N- Maximum number of rows to extract
Output Options¶
# Compression settings
--compression [ZSTD|GZIP|BROTLI|LZ4|SNAPPY|UNCOMPRESSED]
--compression-level [1-22]
# Row group sizing
--row-group-size [exact row count]
--row-group-size-mb [target size like '256MB' or '1GB']
# Workflow options
--dry-run # Preview SQL without executing
--verbose # Detailed output
--preview # Preview results (partition commands)
--hive # Use Hive-style partitioning
--overwrite # Overwrite existing files
--profile NAME # AWS profile for S3 operations
Examples¶
# Extract all data
gpio extract input.parquet output.parquet
# Extract specific columns
gpio extract data.parquet output.parquet --include-cols id,name,area
# Exclude columns
gpio extract data.parquet output.parquet --exclude-cols internal_id,temp
# Filter by bounding box
gpio extract data.parquet output.parquet --bbox -122.5,37.5,-122.0,38.0
# Filter by geometry from file
gpio extract data.parquet output.parquet --geometry @boundary.geojson
# SQL WHERE filter
gpio extract data.parquet output.parquet --where "population > 10000"
# Combined filters
gpio extract data.parquet output.parquet \
--include-cols id,name \
--bbox -122.5,37.5,-122.0,38.0 \
--where "status = 'active'"
# Extract from remote file
gpio extract s3://bucket/data.parquet output.parquet --bbox 0,0,10,10
# Preview query with dry run
gpio extract data.parquet output.parquet \
--where "name LIKE '%Hotel%'" \
--dry-run
Column Selection Behavior¶
- include-cols only: Select specified columns + geometry + bbox (if exists)
- exclude-cols only: Select all columns except specified
- Both: Select include-cols, but exclude-cols can remove geometry/bbox
- Geometry and bbox always included unless explicitly excluded
Spatial Filtering Details¶
--bbox: Uses bbox column for fast filtering when available (bbox covering), otherwise calculates from geometry--geometry: Supports inline GeoJSON/WKT, file reference (@filepath), or stdin (-)- CRS warning: Tool warns if bbox looks like lat/long but data uses projected CRS
WHERE Clause Notes¶
- Accepts any valid DuckDB SQL WHERE expression
- Column names with special characters need double quotes in SQL:
"crop:name" - Shell escaping varies by platform - see User Guide for examples
- Dangerous SQL keywords (DROP, DELETE, etc.) are blocked for safety