This document provides an overview of the utility scripts available in the project.
The scripts/ directory contains various utility scripts for managing the application's data, testing functionality, and performing maintenance tasks.
Purpose: Processes and generates Allmaps annotations for resources in the database.
Key Features:
- Processes individual resources or all resources in the database
- Generates Allmaps IDs and annotations
- Updates item records with Allmaps attributes
- Supports reprocessing of existing resources
- Implements logging and error handling
Usage:
# Process a specific item
python process_allmaps.py --item-id "9139578d-7803-4f4f-9ed3-a62ab810a256"
# Process all items
python process_allmaps.py --allRequirements:
- Database connection must be properly configured
- Item must have a valid manifest URL
- Item must not already have Allmaps attributes (unless reprocessing)
Output:
- Updates the
item_allmapstable with generated Allmaps data - Logs processing status and any errors encountered
Purpose: Manages and populates relationship data between documents in the database.
Key Features:
- Processes various types of document relationships (isPartOf, hasMember, isVersionOf, etc.)
- Maintains bidirectional relationships
- Clears existing relationships before populating new ones
- Implements logging to both console and file
Usage:
python scripts/populate_relationships.pyFrom the project root you can run:
make populate-relationships(See relationships.md for how relationships interact with search and how to query the DB.)
Purpose: Generates and stores embeddings for FAST gazetteer data using OpenAI's API.
Key Features:
- Uses OpenAI's text-embedding-3-small model
- Processes records in batches
- Stores embeddings in the database
- Implements error handling and logging
Requirements:
- OpenAI API key must be set in environment variables
Usage:
python scripts/generate_fast_embeddings.pyPurpose: Executes database migrations.
Key Features:
- Supports multiple migration types
- Implements command-line argument parsing
- Provides logging of migration progress
Available Migrations:
add_fast_gazetteer: Adds FAST gazetteer data to the database
Usage:
python scripts/run_migration.py add_fast_gazetteerPurpose: Imports OCLC FAST Dataset Geographic entries into the database.
Key Features:
- Asynchronous data import
- Progress tracking and reporting
- Error handling and logging
- Performance metrics (records processed, elapsed time)
Usage:
python scripts/import_fast.pyPurpose: Clears the Redis cache used by the application.
Key Features:
- Clears all Redis databases
- Reports memory usage after clearing
- Configurable Redis connection parameters
- Error handling and logging
Usage:
python scripts/clear_cache.pyPurpose: Clears cache by type using tag-based invalidation. Used by make kamal-clear-cache when the exec container cannot reach the public API URL (avoids curl/HTTP).
Key Features:
- Tag-based invalidation (search, resource, suggest, map, all)
- Connects directly to Redis; no HTTP required
- Same behavior as the admin cache-clear endpoint
Usage:
python scripts/clear_cache_by_type.py [search|resource|suggest|map|all]
python scripts/clear_cache_by_type.py search # default
python scripts/clear_cache_by_type.py allPurpose: Tests the functionality of gazetteer API endpoints.
Key Features:
- Tests multiple gazetteer sources (GeoNames, Who's on First, BTAA)
- Provides detailed output of test results
- Configurable base URL for testing different environments
- Pretty-prints JSON responses
Usage:
python scripts/test_gazetteer_api.py [--base-url URL]Purpose: Bootstraps the shared deploy SSH account used for Kamal deployments on remote hosts.
Key Features:
- Creates the
deploygroup and user if needed - Adds
deployto thedockergroup - Seeds
/home/deploy/.ssh/authorized_keysfrom the current remote operator - Prepares
/var/lib/btaa-geospatial-apifor shared bind mounts - Pre-creates the Elasticsearch bind-mount directory with GID 0 write access for fresh hosts
Usage:
backend/scripts/bootstrap_kamal_deploy_user.sh \
--host lib-geoportal-dev-web-01.oit.umn.edu \
--ssh-user your_existing_admin_userRequirements:
- Run from the repo root on your local machine
- The existing remote SSH user must already have passwordless
sudo - Docker should already be installed on the target host
See also docs/backend/kamal_deployment.md for the full Kamal runbook.
Purpose: Maintains analytics partitions, rollups, and retention.
See also Analytics Program for the full analytics architecture and operating model.
Key Features:
- Ensures monthly partitions exist for raw
analytics_*tables - Rolls up completed daily analytics into compact summary tables
- Drops expired raw partitions only after rollups have caught up
- Prints relation sizes for analytics parents, partitions, and rollups
Usage:
cd backend
python scripts/manage_analytics_storage.py --mode maintenance
python scripts/manage_analytics_storage.py --mode size-report
python scripts/manage_analytics_storage.py --mode ensureFrom the project root you can run:
make analytics-maintenance
make analytics-size-reportAll scripts share some common features:
- Logging configuration
- Error handling
- Environment variable support
- Python path configuration for module imports
Several scripts require specific environment variables:
OPENAI_API_KEY: Required bygenerate_fast_embeddings.pyREDIS_HOSTandREDIS_PORT: Used byclear_cache.py
Rate limiting for the public API is enforced by middleware backed by Redis:
-
Documentation shell routes (
/api/docs,/api/redoc,/api/openapi.json, and docs branding assets) bypass throttling so users can always load the API reference. Interactive requests made from those docs still hit the normal API endpoints and remain rate limited. -
Tables and seed data are created by the standard migration script:
.venv/bin/python scripts/run_migrations.py
This will ensure:
api_service_tiers– service tier definitions and per-minute limitsapi_keys– hashed API keys associated with tiersanalytics_api_usage_logs– analytics for incoming requestsanalytics_searches– normalized Geoportal search analyticsanalytics_search_impressions– result impressions with rank/page/viewanalytics_events– resource views, result clicks, downloads, and outbound link eventsanalytics_daily_api_usage_metrics,analytics_daily_search_metrics,analytics_daily_resource_metrics– compact daily rollupsanalytics_maintenance_state– rollup checkpoint state
-
Runtime configuration is controlled via environment variables (see also ../development.md):
RATE_LIMIT_ENABLED=true RATE_LIMIT_REDIS_DB=2 REDIS_HOST=redis REDIS_PORT=6379 REDIS_PASSWORD=optional_password ANALYTICS_RETENTION_API_USAGE_DAYS=30 ANALYTICS_RETENTION_SEARCH_DAYS=90 ANALYTICS_RETENTION_IMPRESSION_DAYS=30 ANALYTICS_RETENTION_EVENT_DAYS=90
To create and manage API keys from the command line, you can call the admin endpoints with basic auth, for example:
curl -u "$ADMIN_USERNAME:$ADMIN_PASSWORD" \
-X POST "http://localhost:8000/api/v1/admin/api-keys" \
-H "Content-Type: application/json" \
-d '{"tier_name": "anonymous", "name": "local test key"}'The response will include the plaintext api_key (shown once) and the numeric key_id. You can then use the key in API requests via:
X-API-KeyheaderAuthorization: Bearer <api_key>headerapi_key=<api_key>query parameterLOG_PATH: Optional path for log files
All scripts implement logging with the following characteristics:
- Log level: INFO by default
- Format: Timestamp, logger name, level, and message
- Output: Console and/or file depending on the script