Skip to content

moderneinc/mass-ingest-example

Repository files navigation

Mass ingest

Production-ready examples for ingesting large numbers of repositories into Moderne using the Moderne CLI.

Choose your deployment stage

This repository provides three progressive deployment examples. Each stage is completely independent and self-contained - you can start at any stage based on your needs.

1-quickstart: Get started quickly

Best for:

  • Quick proof of concept
  • Small repository counts (< 1.000 repos)
  • Development and testing
  • Learning how mass-ingest works

What's included:

  • Single Docker container
  • Manual docker commands
  • Basic monitoring via CLI metrics endpoint

Resources needed:

  • 2 CPU cores
  • 16 GB RAM
  • 32+ GB disk

→ Start with 1-quickstart


2-observability: Add monitoring and visibility

Best for:

  • Production use on a single host
  • Small repository counts (< 1.000 repos)
  • Medium repository count with manual scaling (<10.000 repos)
  • Need for operational visibility
  • Continuous ingestion workflows

What's included:

  • Docker Compose orchestration
  • Integrated Grafana dashboards
  • Prometheus metrics collection
  • Automated restarts and scheduling

Resources needed:

  • 3 CPU cores (2 for mass-ingest, 1 for monitoring)
  • 18 GB RAM (16 for mass-ingest, 2 for monitoring)
  • 50+ GB disk

→ Start with 2-observability


3-scalability: Scale to production

Best for:

  • Large repository counts (>10.000 repos)
  • Parallel processing requirements
  • Production deployment with automatic scaling
  • Enterprise environments

What's included:

  • Cloud-native batch services (AWS Batch, GCP Batch)
  • Terraform infrastructure as code
  • Scheduled automation (daily/weekly)
  • Auto-scaling compute — scales to zero when idle
  • Production monitoring and cost optimization

Resources needed:

  • Cloud account (AWS or GCP)
  • Terraform >= 1.0
  • VPC with internet access
  • Configurable compute (scales from 0 to 256+ vCPUs)

→ Start with 3-scalability


Repository structure

mass-ingest-example/
├── Dockerfile            # Container image definition (used by all stages)
├── Dockerfile.fips       # FIPS 140-2/140-3 compliant variant (UBI 9)
├── publish.sh            # Main ingestion script
├── publish.ps1           # PowerShell version
├── repos.csv             # Example repository list
│
├── 1-quickstart/         # Single container deployment
│   └── README.md
│
├── 2-observability/      # Docker Compose with monitoring
│   ├── docker-compose.yml
│   ├── .env.example
│   ├── observability/    # Grafana and Prometheus configs
│   └── README.md
│
├── 3-scalability/        # Cloud-native batch deployment (multi-cloud)
│   ├── README.md          # Platform comparison and architecture overview
│   ├── aws-batch/         # AWS Batch + EventBridge + Secrets Manager
│   │   ├── chunk.sh
│   │   ├── terraform/
│   │   └── README.md
│   ├── gcp-batch/         # GCP Batch + Cloud Scheduler + Secret Manager
│   │   ├── task.sh
│   │   ├── terraform/
│   │   └── README.md
│
└── diagnostics/          # Comprehensive diagnostic system
    ├── diagnose.sh       # Main orchestration script
    ├── lib/              # Shared libraries
    │   ├── core.sh       # Colors, output formatting, utilities
    │   └── latency.sh    # Latency and throughput testing
    └── checks/           # Modular check scripts
        ├── system.sh     # CPUs, memory, disk space
        ├── tools.sh      # git, curl, jq, etc.
        ├── docker.sh     # Container detection, CPU arch, emulation
        ├── threads.sh    # Cgroup PID limits, ulimit, kernel threads-max
        ├── java.sh       # JDKs, JAVA_HOME
        ├── cli.sh        # mod CLI version, config
        ├── config.sh     # Env vars, credentials
        ├── repos-csv.sh  # File validation, columns, origins
        ├── network.sh    # Connectivity to all hosts
        ├── ssl.sh        # SSL handshakes, cert expiry
        ├── auth-publish.sh # Write/read/delete test
        ├── auth-scm.sh   # .git-credentials validation
        ├── publish-latency.sh # Publish URL latency and throttling
        ├── maven-repos.sh # Maven repos from settings.xml
        ├── dependency-repos.sh # User-specified repos (Gradle, etc.)
        └── scm-repos.sh  # SCM connectivity per origin

Prerequisites (all stages)

Before starting with any stage, you'll need:

  1. Repository list: Create repos.csv with repositories to ingest

    cloneUrl,branch,origin,path
    https://github.com/org/repo1,main,github.com,org/repo1
    https://github.com/org/repo2,main,github.com,org/repo2
  2. Artifact repository: Maven-formatted repository for publishing LSTs

    • Artifactory, Nexus, or similar
    • Dedicated repository recommended (separate from other artifacts)
    • Credentials with publish permissions
  3. Source control access: If repositories require authentication

    • Service account with read access to all repositories
    • Personal access token or credentials
  4. Docker: Installed and running (for stages 1 and 2)

  5. Bash: Required in the container image (Alpine users: apk add bash)

  6. Cloud account: AWS or GCP account (required only for stage 3)

Quick comparison

Feature 1-quickstart 2-observability 3-scalability
Deployment Single container Docker Compose Cloud-native batch + Terraform
Monitoring CLI metrics endpoint Grafana + Prometheus Cloud-native logging + optional Grafana
Scaling Manual Single host Auto-scaling parallel workers
Scheduling Manual/cron Docker restart policy Cloud-native scheduler
Cost Lowest Low Scales with usage
Setup time 15 minutes 30 minutes 1-2 hours
Ideal repo count < 100 100-1000 1000+
Parallel processing No No Yes

Common configuration

All stages share the same core configuration needs:

Environment variables

  • PUBLISH_URL - Artifact repository URL (e.g., https://artifactory.example.com/artifactory/moderne-ingest/)
  • PUBLISH_USER - Repository username
  • PUBLISH_PASSWORD - Repository password
  • PUBLISH_TOKEN - Alternative to user/password for JFrog
  • MODERNE_TENANT - Your Moderne tenant url (optional)
  • MODERNE_TOKEN - Moderne API token (optional)

Repository authentication

For private repositories, credentials are mounted at runtime (never baked into images):

  • .git-credentials file for HTTPS
  • .ssh directory for SSH

See each stage's README for specific mounting instructions.

Repository list format

The repos.csv file columns:

  • cloneUrl (required) - Full git clone URL
  • origin (required) - Source identifier (e.g., github.com)
  • path (required) - Repository path/identifier
  • branch (optional) - Branch to build (uses remote default if not specified)
  • gradleVersion (optional) - Selects a specific Gradle version for repos without a wrapper (must match an installation registered via mod config build gradle installation edit)

See repos.csv documentation for advanced options.

Dependency repositories (optional)

Create dependency-repos.csv to test connectivity to Maven/Gradle dependency repositories during diagnostics:

url,username,password,token
https://nexus.example.com/releases,${NEXUS_USER},${NEXUS_PASSWORD},
https://artifactory.example.com/libs,,,${ARTIFACTORY_TOKEN}
https://repo.spring.io/release,,,
  • Use username + password for basic auth
  • Use token for bearer auth (leave username/password empty)
  • Leave all auth fields empty for anonymous access
  • Use ${ENV_VAR} syntax to reference environment variables

See dependency-repos.csv.example for a template.

Build arguments

All Dockerfiles support:

  • MODERNE_CLI_VERSION - Specific CLI version (defaults to latest release)
  • MODERNE_CLI_STAGE - release (default) for latest release from Maven Central, snapshot for latest snapshot
  • MODERNE_CLI_RELEASES_REPO - Maven repository for release CLI artifacts (defaults to https://repo1.maven.org/maven2)
  • MODERNE_CLI_SNAPSHOTS_REPO - Maven repository for snapshot CLI artifacts (defaults to https://central.sonatype.com/repository/maven-snapshots)

FIPS-compliant image

A separate Dockerfile.fips is provided for environments that require FIPS 140-2/140-3 compliance. It uses Red Hat UBI 9 with the FIPS crypto policy enabled, which restricts all cryptographic operations to FIPS-approved algorithms.

Build:

docker build -f Dockerfile.fips -t mass-ingest:fips .

Build arguments (in addition to MODERNE_CLI_VERSION):

Argument Default Description
MAVEN_REPO_URL https://repo1.maven.org/maven2 Maven repository for CLI and Maven
GRADLE_DIST_URL https://services.gradle.org/distributions Gradle distribution download URL
GRADLE_VERSION 8.14 Primary Gradle version to install
GRADLE_EXTRA_VERSIONS (empty) Comma-separated additional Gradle versions (e.g., 6.9.4,5.6.4)
MAVEN_VERSION 3.9.11 Maven version to install

Using internal mirrors:

Public download servers (Maven Central, Gradle services) may not support FIPS-compliant TLS cipher suites. The Dockerfile uses a separate download stage without FIPS restrictions to handle this. To make the entire build FIPS-compliant end to end, point the download URLs at internal mirrors that support FIPS-compliant TLS:

docker build -f Dockerfile.fips \
  --build-arg MAVEN_REPO_URL=https://nexus.internal/repository/maven-central \
  --build-arg GRADLE_DIST_URL=https://nexus.internal/repository/gradle-dist \
  -t mass-ingest:fips .

When using internal mirrors, you can remove the downloader stage from the Dockerfile and move its ARG and RUN commands into the base stage (after the dnf install that provides curl). This makes the entire build FIPS-compliant.

Run: All docker run commands from the stage READMEs work unchanged — just substitute the image name:

docker run --rm \
  -p 8080:8080 \
  -v $(pwd)/data:/var/moderne \
  -e PUBLISH_URL=https://your-artifactory.com/artifactory/moderne-ingest/ \
  -e PUBLISH_USER=your-username \
  -e PUBLISH_PASSWORD=your-password \
  mass-ingest:fips

JDK 8 and 11 TLS 1.3 workaround:

RHEL 9 backported TLS 1.3 into JDK 8 and 11, but the backported P11AEADCipher has a bug in AES-GCM decryption that causes TLS 1.3 handshakes to fail with CKR_ENCRYPTED_DATA_INVALID when running through NSS in FIPS mode. JDK 17+ has the fix. The Dockerfile disables TLS 1.3 for JDK 8 and 11, forcing them to use TLS 1.2 which works correctly. This is strictly more restrictive than stock FIPS — same algorithm restrictions plus TLS 1.3 disabled. JDK 17+ is unaffected and uses TLS 1.3 normally.

Key differences from the standard image:

Aspect Standard (Dockerfile) FIPS (Dockerfile.fips)
Base image Eclipse Temurin (Ubuntu) Red Hat UBI 9
JDK provider Adoptium Temurin Red Hat OpenJDK
JDK versions 8, 11, 17, 21, 25 8, 11, 17, 21, 25
Crypto policy Default (unrestricted) FIPS (update-crypto-policies --set)
Certificate mgmt Per-JDK keytool System trust store (update-ca-trust)
Package manager apt-get dnf

Note

For full kernel-level FIPS compliance, the host OS must also be running in FIPS mode. The container enforces FIPS-approved algorithms at the userspace level (OpenSSL, Java security providers) regardless of host configuration.

Generating repository lists

We provide scripts to generate repos.csv from various sources:

Diagnostics

The diagnostics/ directory contains a comprehensive diagnostic system to validate your mass-ingest setup before starting ingestion.

Diagnostic mode (full validation)

Run comprehensive diagnostics without starting ingestion:

DIAGNOSE=true docker compose up

This validates the entire setup and produces a detailed report:

  • System (CPUs, memory, disk space)
  • Required tools (git, curl, jq, unzip, tar)
  • Runtime environment (container detection, CPU architecture, emulation)
  • Thread/process limits (cgroup PID limits, ulimit, kernel threads-max)
  • Java/JDKs (available JDKs, JAVA_HOME)
  • Moderne CLI (version, build config, proxy, trust store, tenant)
  • Configuration (env vars, credentials, git credentials)
  • repos.csv (file validation, columns, origins, sample entries)
  • Network (Maven Central, Gradle plugins, publish URL, SCM hosts)
  • SSL/Certificates (handshakes, expiry warnings)
  • Authentication (publish write/read/delete test, SCM credentials validation)
  • Publish latency (throughput testing, rate limit detection)
  • Maven repositories (dependency repo connectivity from settings.xml)
  • Dependency repositories (user-specified repos from dependency-repos.csv)
  • SCM repositories (connectivity testing per origin from repos.csv)

The container exits with code 0 if all checks pass, or 1 if any failures are detected.

Use cases:

  • Initial setup validation before first real run
  • After configuration changes before deploying
  • Troubleshooting when something stops working
  • Generating diagnostic output to send to Moderne support

Diagnostics at startup

Set DIAGNOSE_ON_START=true to run diagnostics before ingestion starts:

docker run -e DIAGNOSE_ON_START=true ...

This runs all diagnostic checks and then proceeds to normal ingestion regardless of the results. Use this to capture diagnostic output in your logs while still attempting ingestion.

Running diagnostics directly

You can run the main diagnostic script or individual checks:

# Full diagnostics
./diagnostics/diagnose.sh

# Individual checks can be run directly
./diagnostics/checks/docker.sh
./diagnostics/checks/network.sh
./diagnostics/checks/auth-publish.sh

Example output

Mass-ingest Diagnostics
Generated: 2025-01-20 14:32 UTC

=== System ===
[PASS] CPUs: 4
[PASS] Memory: 12.5GB / 16.0GB available
[PASS] Disk (data): 45.2GB / 100.0GB available

=== Required tools ===
[PASS] git: 2.39.3
[PASS] curl: 8.4.0
[PASS] jq: 1.7
[PASS] unzip: 6.00
[PASS] tar: 1.35

=== Runtime environment ===
[PASS] Running inside Docker
       Base image: Ubuntu 24.04.1 LTS
[PASS] Architecture: x86_64 (no emulation detected)

=== Thread and process limits ===
       Java builds use many threads. Low PID/thread limits cause 'pthread_create' errors.
       Expect: unlimited or 8192+ for cgroup PID limit and ulimit.
[PASS] Cgroup PID limit: unlimited (3 currently used)
[PASS] Max user processes (ulimit -u): unlimited
       Kernel threads-max: 127733

=== Java/JDKs ===
[PASS] JAVA_HOME: /opt/java/openjdk
       Detected JDKs (mod config java jdk list):
         21.0.1-tem   $JAVA_HOME     /opt/java/openjdk
         17.0.9-tem   OS directory   /usr/lib/jvm/temurin-17
[PASS] 5 JDK(s) available in /usr/lib/jvm/

=== Moderne CLI ===
[PASS] CLI installed: v3.56.0
       Configuration:
         Trust store: default JVM
         Proxy: not configured
         LST artifacts: Maven (https://artifactory.company.com/moderne)
         Build timeouts: default

=== Configuration ===
[PASS] DATA_DIR: /var/moderne (writable)
[PASS] PUBLISH_URL: https://artifactory.company.com/moderne
[PASS] Publish credentials: PUBLISH_USER/PASSWORD set
       Git credentials:
[PASS] HTTPS credentials: /root/.git-credentials (2 entries)

=== repos.csv ===
[PASS] File: /app/repos.csv (exists)
[PASS] Repositories: 427
[PASS] Required columns: cloneUrl, origin, path (present)
[PASS] Additional column: branch (present)
       Repositories by origin:
         github.com: 412 repos
         gitlab.internal.com: 15 repos
       Sample entries (first 3):
         https://github.com/company/repo-one (main)
         https://github.com/company/repo-two (main)

=== Network ===
[PASS] Maven Central: reachable (45ms)
[PASS] Gradle plugins: reachable (52ms)
[PASS] PUBLISH_URL: reachable (23ms)
[PASS] github.com: reachable (31ms)
[FAIL] gitlab.internal.com: unreachable

=== SSL/Certificates ===
[PASS] artifactory.company.com: SSL OK (expires in 285 days)
[PASS] github.com: SSL OK (expires in 180 days)
[PASS] repo1.maven.org: SSL OK (expires in 340 days)

=== Authentication - Publish ===
[PASS] Write test: succeeded (HTTP 201)
[PASS] Read test: succeeded (HTTP 200)
[PASS] Overwrite test: succeeded (HTTP 201)
[PASS] Delete test: succeeded (HTTP 204)

=== SCM credentials ===
[PASS] .git-credentials: found 2 credential(s)
[PASS] .git-credentials: file is read-only (mode 400)

=== Publish latency ===
       Testing PUBLISH_URL (10 sequential requests)...
       Sequential: min=23ms avg=45ms max=89ms
[PASS] PUBLISH_URL: average latency 45ms
       Testing PUBLISH_URL (3 × 20 concurrent)...
       Parallel batches: 850ms, 820ms, 890ms
[PASS] PUBLISH_URL: parallel throughput 42ms/request

=== Maven repositories ===
       Using: /root/.m2/settings.xml
       Testing central (10 sequential requests)...
       Sequential: min=38ms avg=42ms max=67ms
[PASS] central: average latency 42ms
       Testing central (3 × 20 concurrent)...
       Parallel batches: 920ms, 880ms, 950ms
[PASS] central: parallel throughput 45ms/request
       Testing internal-nexus (via mirror: nexus-mirror) (10 sequential requests)...
       Sequential: min=15ms avg=18ms max=24ms
[PASS] internal-nexus (via mirror: nexus-mirror): average latency 18ms
       Testing internal-nexus (via mirror: nexus-mirror) (3 × 20 concurrent)...
       Parallel batches: 380ms, 350ms, 390ms
[PASS] internal-nexus (via mirror: nexus-mirror): parallel throughput 18ms/request

=== Dependency repositories ===
       Using: ./dependency-repos.csv
       Testing nexus.example.com (10 sequential requests)...
       Sequential: min=19ms avg=23ms max=31ms
[PASS] nexus.example.com: average latency 23ms
       Testing nexus.example.com (3 × 20 concurrent)...
       Parallel batches: 480ms, 450ms, 510ms
[PASS] nexus.example.com: parallel throughput 24ms/request

========================================
RESULT: 1 failure(s), 0 warning(s), 24 passed
========================================

Support and documentation

License

This example code is provided as-is for use with Moderne products.

About

Example docker files and other collateral for bootstrapping mass-ingest of LSTs for use with Moderne

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors