Dashboard - Job Allocator

Run	Doc	Service	State	Attempt	Patient	pId	Collection date	Received	Duration
loading…

What this is

The Longevity.Omics Job Allocator watches the Firestore reports collection for documents with status == "Data Ready". For each one it runs the appropriate bash pipeline (Gen-Decoder, Epi-Insight, or Omni-Health) on this server, captures logs, retries on failure, and reports outcomes back to Firestore. This dashboard is the operator-facing window into all of that.

Architecture (one allocator process, several threads)

Firestore listener — provided by the Firebase SDK. Subscribes to reports where status == "Data Ready"; pushes every change into an internal event queue.
Dispatcher (1 thread) — drains the event queue. For each Data Ready doc, dedup-checks SQLite and inserts a new job_runs row in waiting_cooldown (Gen-Decoder gets a 24h hold) or queued (other services).
Scheduler (1 thread, tick-based) — runs five periodic tasks: heartbeat, cooldown promote, Omni dependency watcher, control-command processor, daily cleanup.
Worker pool (N threads, default 1) — pops jobs off ready_queue, executes the bash subprocess with a wall-clock timeout, classifies the outcome (success / temp_fail / retry / permanent), updates SQLite + Firestore.

Data flow

New Data Ready doc arrives in Firestore → listener emits ADDED → dispatcher inserts SQLite row (waiting_cooldown or queued) → scheduler's cooldown promoter moves cooldown rows to queued when their scheduled_run_at is up → worker pops from ready_queue, runs bash run_*.sh, captures exit code + log → outcome routed (succeeded / failed_retrying / temp_failed / failed_sync / failed_permanent), Firestore updated where appropriate.

Source of truth

SQLite (/data/auto-job/allocator/data/allocator.db) — the allocator's own state truth. Every job attempt is one row in job_runs. Survives crashes — recovery on next start re-rescues running rows. Active states (received, waiting_cooldown, queued, running) are never auto-cleaned; terminal states clean after 14 / 30 / 90 days.
Firestore — communicates with the rest of the business: Data Ready means "please process me", Failed means "we tried and gave up", Available means "downstream wrote results, ready to display". Allocator reads Data Ready, writes Failed; downstream writes Available.

Two-step write order matters

Whenever the allocator marks a row terminal, it writes Firestore first, then SQLite. This closes a race window: the listener filter (status == "Data Ready") stops emitting events for the doc the moment Firestore commits, so any mid-flight reconnect/replay is silenced before SQLite changes. Skip this order and the listener can resurrect already-failed rows during the few-millisecond gap.

Terminology cheat sheet

attempt — one execution try. Numbered from 1. Counts toward MAX_RETRIES.
cooldown — a delay before running. Used by Gen-Decoder (24h grace) and by retry scheduling (60s) and by failed_sync auto-retry (24h).
temp-fail — bash returned exit 75 ("data not ready on OSS"). Doesn't consume a retry — pipeline can wait.
replay — Firestore SDK behavior: on reconnect/refresh, it re-emits ADDED for every doc currently in the result set. Dedup logic in the dispatcher exists to handle this.
transient — in the Daily throughput chart, "transient" = failed_retrying + temp_failed (probably resolved themselves). "Failed" = failed_permanent only (needs a human).

The 9 states

Every row in job_runs is in exactly one state. The state determines what happens next (or whether anything will). Click a Jobs-by-state card on the dashboard to see all rows in a given state.

Active (in-flight, never auto-cleaned)

received — dispatcher just inserted the row. Transient (a few ms before transitioning to waiting_cooldown or queued).
waiting_cooldown — has a scheduled_run_at in the future. Three sources:
- Gen-Decoder 24h hold — fresh Data Ready doc, gives the human time to upload data to OSS.
- Retry delay — after a regular failure (60s by default).
- Temp-fail reschedule — after exit 75, retries in 1h.
- Failed-sync auto-retry — after Firestore-side write failure, retries in 24h with attempt reset to 1. Tagged notes='sync_retry'; shows up with light orange background in the cooldown queue.
queued — sitting on the in-memory ready_queue, waiting for a worker. Cooldown promoter atomically moved it from waiting_cooldown when scheduled_run_at passed.
running — worker has a subprocess actively executing the bash script. Subprocess timeout = SUBPROCESS_TIMEOUT_SECONDS (default 2h).

Terminal — succeeded

succeeded — bash returned exit 0. Allocator does not write Firestore Available — that's a downstream program's job. Auto-cleaned after 30 days.

Non-terminal failure (will retry)

failed_retrying — bash returned non-zero, retry budget remains. Worker inserted a new row at attempt+1 in waiting_cooldown. Auto-cleaned after 14 days.
temp_failed — bash returned exit 75. Worker inserted a new row at the same attempt (no retry consumed) in waiting_cooldown + 1h. Auto-cleaned after 14 days.

Terminal failure (will not auto-retry)

failed_sync — pipeline finished (or exhausted retries) but allocator could NOT write Firestore Failed for this specific doc (per-doc problem like missing doc / permission / quota; Firestore network is fine). Audit row. Worker also inserts a fresh waiting_cooldown row at attempt=1 with FAILED_SYNC_RETRY_DELAY_SECONDS (default 24h) — so this is "auto-retry after a long pause"; operators can also Skip from the cooldown queue to release immediately.
failed_permanent — final negative outcome. Producers:
- Worker exhausted MAX_RETRIES (default 3) and Firestore Failed write succeeded.
- Dashboard Kill command.
- Dashboard Cancel cooldown command.
- Dashboard Mark failed command.
- Dispatcher withdraw: admin changed Firestore status away from Data Ready (e.g. "Cancelled", "On Hold").
- Recovery: row was running when allocator crashed and the next attempt would exceed MAX_RETRIES.
Auto-cleaned after 90 days.

Listener-replay dedup

Firestore SDK occasionally re-emits ADDED for every doc currently matching the query (reconnect, internal refresh). Without dedup the allocator would create a duplicate run row each time. The dispatcher blocks creating a new row when there's already a row for this doc_id in:

Any active state (received / waiting_cooldown / queued / running)
succeeded — already done, don't redo
failed_retrying — short window inside the worker between marking failed_retrying and inserting the next-attempt row
temp_failed — same window analysis
failed_sync — has its own auto-retry row, don't double up

Not blocked: failed_permanent. This is on purpose — the dashboard's "Reset Firestore → Data Ready" button sets Firestore back to Data Ready and relies on dedup NOT blocking the resulting ADDED, so a fresh attempt 1 row gets created. If the only existing row is failed_permanent, listener replay creates a new attempt — exactly the manual-retry workflow.

Why state changes always go: Firestore → SQLite

For any path that ends a row in failed_permanent, the allocator writes Firestore Failed FIRST. Once Firestore is Failed, the listener stops emitting events for this doc, and SQLite is still running (active, dedup-blocked) so any in-flight replay is harmless. THEN we flip SQLite to terminal. The old code did SQLite first and lost a race window of ~30ms in which listener replay could create a parallel attempt (the "extra retry round" bug).

The only exception is dispatcher withdraw: admin changed Firestore on their own, so allocator just mirrors to SQLite without touching Firestore back.

Header

+ Manual Add Job — adds an order to the Process Queue when its lab data won't arrive automatically (e.g. hospital uses its own lab). See the Manual Add Job section below for the full workflow.
? Help — opens this modal.
⚙ Config — edit runtime tunables (see below).
⟳ Refresh — re-fetches every section. Auto-refresh runs anyway, this is for impatience.
⏻ Restart — issues a restart command. Allocator exits cleanly; systemd respawns within ~10s. In-flight subprocesses are NOT killed mid-run — they're given up to 30s to finish, then SIGKILL.

Jobs by state (5 cards)

Click any card to see all jobs in that state in a modal. Click a row in the modal to drill into the drawer.

Running — count of running rows. With MAX_WORKERS=1 this is at most 1.
Queued — queued rows. Should drain quickly unless workers are stuck.
Pending — sum of three things waiting on something: waiting_cooldown (timed hold) + failed_sync (Firestore-write retry hold) + manual_hold (Manual Add Job, awaiting operator release). The card sub-label shows the breakdown. Not clickable — sources span two SQLite tables, so go to the Process Queue panel below to see the rows.
Permanent fails — terminal failures still in retention window (90 days).

System status

Allocator ALIVE/DEAD — based on heartbeat to Firestore (every 30s by default). Stale > 60s = DEAD.
CPU / Memory / Disk — server-level metrics.

Connectivity

Three on-demand checks, results cached:

Firestore — does a tiny read against the reports collection. Used by the worker too: when a per-doc Firestore write fails, worker calls this to decide whether the whole network is down (→ outage mode) or just the one doc has trouble (→ failed_sync).
Gemini API — does a real generate-content call with the API key from /data/auto-job/pipelines/Omni-Health/.env. Picks the first generateContent-capable model dynamically (avoids hardcoding model names that Google deprecates).
OCR API — pings http://127.0.0.1:18000/health.

Analytics

Four charts. Top-right toggle switches all four between Week (7d) and Month (30d):

Daily throughput — succeeded / transient / failed per day (stacked).
Service distribution — share of recent jobs by service (Gen-Decoder / Epi-Insight / Omni-Health).
Average duration by service — successful runs only.
Top failure reasons — failure_message buckets, top 5.

Click a chart legend to hide/show that series.

Manual Add Job

For orders where the lab data won't arrive automatically (hospital uses its own ordering system or its own non-partner lab). The order already exists in Firestore but its status is whatever the hospital set — NOT "Data Ready". Manual Add tells the dashboard to track it and email the lab team to upload data.

Adding (click + Manual Add Job in the header):

Single mode: paste a Firestore report_id, click Validate. The dashboard reads the doc, refuses if it's already Data Ready or its service isn't allocator-managed, and adds it to the pending list with all fields auto-filled.
CSV mode: upload a CSV with one column report_id (header optional). Each row is validated independently; failures are listed under the upload box and skipped from the pending list.
The extraFiles editor is shown for every row. Optional for Gen-Decoder / Epi-Insight (reserved for future use). Required for Omni-Health, where it lists the dependency report doc_ids (Gen-Decoder / Epi-Insight / Body Test). Each dep is validated lazily on blur and shows ✓+service or ✗+error.
Submit All & Send Email writes the holds and sends ONE aggregated email to the lab team listing every Gen-Decoder / Epi-Insight upload path. Omni-Health holds appear in the email but say (no upload required). Submission is all-or-nothing — if anything fails (e.g. setup_patient_info can't write the directory), the whole batch is rolled back, no email sent.

Behind the scenes when Submit lands: for each (patient_id, report_id, service), the dashboard creates two directory trees and a patient_info.json in each, sourcing patient + report fields from Firestore. The patient_info.json is what downstream pipelines will read once the lab uploads data. Idempotent: re-submitting after a cancel reuses the existing files (does NOT re-fetch from Firestore — if the data changed in Firestore, force a rewrite via the setup_patient_info.py --force-rewrite CLI).

After submit, holds appear in the Process Queue (purple H# rows, no timer). The operator's job from there: wait until the lab uploads. When the lab confirms, click Release on that row. See Process Queue below for per-row actions.

What "Release" does differently from a normal Data Ready: writes Firestore status=Data Ready PLUS a manualHoldRelease=true field. The dispatcher reads that flag and skips Gen-Decoder's 24h cooldown (lab already confirmed upload, no need to wait) and skips the upload-reminder email. Other services unaffected.

Process Queue

One unified table of work that's blocked on something. Two row sources are interleaved:

Manual hold rows (top, ID prefix H# in purple): orders added via + Manual Add Job that are waiting for the operator to click Release. No timer — they sit indefinitely. See the Manual Add Job section below.
Cooldown rows (below, ID prefix #): timed holds — fresh Gen-Decoder 24h cooldowns, retry delays, sync-retry holds (light orange background) and OCR rerun rows (light blue background).

Per-row actions on cooldown rows:

Skip — atomically promote to queued NOW. For a Gen-Decoder hold this means "data is already on OSS, skip the 24h wait". For a sync-retry hold it means "I fixed the Firestore issue, run it now". Atomic — concurrent cooldown promoter ticks won't double-enqueue.
Cancel — abandon. Marks failed_permanent + writes Firestore Failed. After 1.5s grace period (defends against accidental clicks racing a worker that just succeeded).
Remind (Gen-Decoder cooldowns only, NOT sync-retry) — re-sends the upload reminder email to the data team. Useful when the human hasn't uploaded yet and you need to nag.

Per-row actions on manual hold rows:

Release — flips the doc's Firestore status to Data Ready (with a manualHoldRelease=true flag). Listener picks it up. Gen-Decoder skips the 24h cooldown (lab confirmed upload, no need to wait). Pre-checks Firestore current status — if a doc is somehow already Data Ready, the release refuses and keeps the hold in place so you can investigate.
Resend email (NOT Omni-Health, which doesn't need lab data) — sends a single-row reminder email to the lab team for this one hold.
Cancel — drops the hold from the Process Queue WITHOUT touching Firestore. The order in Firestore stays as-is (it's still in whatever non-Data-Ready state it was in).
(N deps) link in the Doc cell — only on Omni-Health holds. Click to see each linked report's service / patient / current Firestore status.

Bulk actions on manual holds: tick the checkbox in any hold row's ID cell to enable the toolbar above the table. The select-all checkbox in the header selects every visible hold (cooldown rows are not selectable). Toolbar offers Release all, Cancel all, Clear selection. Best-effort: any hold that fails (e.g. Firestore is already Data Ready) stays in the queue and the others are still processed; failures are summarised in an alert.

Recent jobs

Search box — substring match across doc_id and patient_id (case-insensitive). 300ms debounce. Enter = search now. Esc = clear.
Service / state filters — combine with the search box (AND).
Per-page — 50 / 100 / 200.
Pager — first / prev / page-input / next / last + total count.
Click any row → drawer opens with attempt history + log viewer.

Drawer (per-doc detail)

Opens when you click a row. Shows:

Meta — service / state / attempts so far / patient / latest run id.
Actions — vary by current state:
- Kill (running): writes Firestore Failed + SQLite failed_permanent + SIGTERMs subprocess. 1.5s grace period — if worker finishes first, kill is abandoned.
- Mark failed (active states except running): no SIGTERM; just terminate the row.
- Cancel cooldown (waiting_cooldown): same as the Cancel button on the cooldown queue.
- Reset Firestore → Data Ready (failed_permanent): sets Firestore back to Data Ready; listener will create a fresh attempt 1.
- Regenerate report with edited OCR (Omni-Health, succeeded only): re-runs the Omni-Health pipeline using either the previously-cached OCR job_id or a new one pasted from Quantum Life Medical Report Manager. Use this when OCR data was edited externally and you want to regenerate the Omni report based on the edits.
- ⚠ Mark failed for retry (succeeded only): destructive — overwrites the succeeded SQLite row in place to failed_permanent and writes Firestore Failed. Use this when the pipeline ran fine but the code/config had a bug, and you want to re-run with the fix. Two-step workflow: click this, wait ~3s for it to land, then click Retry which appears in its place. The drawer auto-refreshes after the first step. NOT for "the run actually failed" — that's just Retry on its own when the row is already failed_permanent.
Attempt history — every row for this doc_id, newest first. Click any past attempt to load that attempt's log into the panel below. With per-attempt log files (run.log.<N>) this works precisely; with old single run.log files all attempts share the latest log.
Log panel — dark terminal-style viewer. Auto-scrolls to bottom on load.

Recent control commands

Audit trail of every command issued from the dashboard (kill / cancel / restart / etc), with timestamp and result. Useful when something seems weird and you want to know what was clicked recently.

Notifications log

Every email send attempt. Status = ok / failed / skipped (toggle off / no recipients). Lets you confirm an email was actually sent vs got dropped.

⚙ Config (runtime tunables)

Editable without restart unless marked. Click Save to apply; if any restart-required key changed, you get prompted to restart now.

MAX_WORKERS ⚠ restart required — concurrent subprocess workers. Requires SANDBOX_ENABLED=1 to use values > 1 safely; allocator forces it back to 1 if sandbox is off.
SANDBOX_ENABLED ⚠ restart required — master switch for the per-run bubblewrap sandbox. 1 = isolated (safe for concurrent workers). 0 = legacy direct-to-disk (only safe with MAX_WORKERS=1). See the Sandbox & concurrency help tab.
MAX_RETRIES — retries before failed_permanent (default 3).
RETRY_DELAY_SECONDS — wait between regular retries (default 60).
TEMP_FAIL_RETRY_DELAY_SECONDS — wait after exit 75 (default 3600).
FAILED_SYNC_RETRY_DELAY_SECONDS — wait after a Firestore-write-failed (default 86400).
SUBPROCESS_TIMEOUT_SECONDS — max bash runtime (default 7200).
COOLDOWN_GEN_DECODER_SECONDS — fresh Gen-Decoder cooldown (default 86400).
COOLDOWN_CHECK_INTERVAL_SECONDS — how often to scan due cooldowns.
OMNI_DEPENDENCY_CHECK_INTERVAL_SECONDS — how often to scan In-Progress Omni docs for dependencies-Available promotion.
HEARTBEAT_INTERVAL_SECONDS — Firestore heartbeat write cadence.
EMAIL_RECIPIENTS_UPLOAD_REMINDER — comma-separated.
EMAIL_RECIPIENTS_PERMANENT_FAILURE — comma-separated. Also receives outage alerts.
NOTIFY_ON_UPLOAD_REMINDER / NOTIFY_ON_PERMANENT_FAILURE — 0 or 1.

1. Pipeline returns non-zero (regular failure)

Worker creates a new waiting_cooldown row at attempt+1 with delay = RETRY_DELAY_SECONDS. Cooldown promoter picks it up after the delay, worker runs again. Repeat up to MAX_RETRIES.

After MAX_RETRIES attempts all fail, the last row goes failed_permanent, Firestore Failed gets written, permanent-failure email goes out.

2. Pipeline returns exit 75 (data not ready on OSS)

This is the Gen-Decoder convention with the bash team: "I checked OSS and the input files aren't there yet". Worker creates a new waiting_cooldown row at the SAME attempt number (no retry consumed) with delay = TEMP_FAIL_RETRY_DELAY_SECONDS (default 1h). The doc stays in this loop until either:

data finally arrives → next attempt succeeds → succeeded
operator gives up → Cancel from cooldown queue → failed_permanent
admin changes Firestore status away from Data Ready → withdraw → failed_permanent

There is no upper bound on temp_failed loops — a doc whose data never arrives will retry forever. If this is a problem, cancel from dashboard or fix the Firestore state.

3. Pipeline succeeded but Firestore won't accept the Failed write

Wait, this sounds wrong — a successful pipeline doesn't write Failed. The case in question is: pipeline failed (used up retries → would normally go failed_permanent) BUT the Firestore Failed write itself errors. Worker calls check_connectivity to disambiguate:

Firestore is up overall, this one doc has trouble (missing doc / permission / quota): row goes failed_sync with [FIRESTORE_SYNC_FAILED] in error_message; worker also inserts a fresh waiting_cooldown row at attempt=1 with FAILED_SYNC_RETRY_DELAY_SECONDS (default 24h). Permanent-failure email goes out. After 24h the cooldown promoter retries the whole pipeline; you can also Skip on the cooldown queue once you've fixed the Firestore-side issue.
Firestore is unreachable: worker sets the global outage event. See item 4.

4. Firestore network outage

If check_connectivity itself fails, the allocator concludes the network is down (vs a per-doc issue) and goes into outage mode:

Worker sets _firestore_outage_event.
Main thread detects it, sends a system-alert email (best effort — Aliyun mail might also be down), unsubscribes the listener, joins threads.
Sleeps 5 minutes. This is deliberate — exiting immediately would have systemd respawn in 10s and we'd hammer-loop. With a 5-minute sleep the effective retry interval is ~5 minutes.
Exits with code 1. systemd's Restart=always respawns.
On respawn, allocator's first action is check_connectivity again. If it passes, normal startup. If not, repeat the sleep+exit cycle.

In-flight rows are not touched in outage mode — they stay in running in SQLite. Recovery on the next successful start handles them like any other crashed-while-running row.

5. Allocator process crashes mid-run

SIGKILL, OOM, server reboot, etc. SQLite rows in running are stranded. On next start, recovery scans them and:

If attempt < MAX_RETRIES: original row → failed_retrying, new queued row inserted at attempt+1.
If attempt >= MAX_RETRIES: row marked failed_permanent + Firestore Failed written + permanent-failure email.

6. Listener replay (Firestore SDK reconnect)

The SDK silently re-emits ADDED for every doc in the result set every few minutes. Dedup in the dispatcher (see State machine tab) prevents duplicate work — a row in active / succeeded / failed_retrying / temp_failed / failed_sync blocks new inserts. Only failed_permanent passes through dedup, and that's only when the operator resets Firestore to Data Ready (the manual-retry path).

7. Admin changes Firestore status while a doc is mid-flight

Listener filter status == "Data Ready" stops matching the doc; SDK emits REMOVED. Dispatcher handles REMOVED by withdrawing the SQLite row:

If row is queued / waiting_cooldown / received: marked failed_permanent with "Withdrawn from Firestore". Firestore is NOT touched (admin already chose the new status).
If row is running: NOT touched. Letting the bash finish is safer than yanking it mid-write. The new Firestore status will only matter if the run fails (allocator would then write Failed, overwriting admin's choice — narrow race window).
If row is terminal: no-op.

8. Dashboard kill races a worker that just succeeded

The kill command sleeps 1.5s before acting (grace period — clicks are human-paced). After the sleep it re-reads SQLite. If state moved to terminal during the sleep, kill is abandoned with a log message. Otherwise: write Firestore Failed → write SQLite failed_permanent (guarded by WHERE state IN active_states, so if the worker raced past, the UPDATE is a no-op and we log a warning) → SIGTERM the subprocess.

9. Daily 03:00 cleanup

Once per day at server-local 03:00, scheduler runs cleanup_old_rows: deletes terminal rows older than their retention (succeeded 30d / failed_retrying 14d / temp_failed 14d / failed_permanent 90d), prunes ocr_cache (30d) and notifications_log (60d). Active rows are NEVER cleaned up regardless of age.

10. Operator-initiated re-run of a succeeded job

Use case: pipeline ran to completion, but afterwards the team realized there was a bug in the pipeline code or a bad config. Want to re-run with the fix without manually re-creating the order in Firestore.

Workflow (in the drawer for the succeeded doc):

Click ⚠ Mark failed for retry. Confirmation dialog explains the trade-offs. After ~3s the row is forcibly transitioned succeeded → failed_permanent (overwriting the succeeded row in place — this is destructive to audit history) and Firestore is set to Failed.
Drawer auto-refreshes; Retry button appears. Click it.
Retry resets Firestore back to Data Ready. Listener fires, dispatcher's stage-1 dedup sees the latest row is failed_permanent (which is the explicit signal "operator wants a retry") and creates a fresh attempt 1. Worker picks it up and runs with the current code.

Why two clicks instead of one: the /api/jobs/.../retry endpoint requires Firestore status=Failed (it's the existing manual-retry path for genuinely-failed jobs). Mark-failed-for-retry is the prep step that gets the system into that state. Combining them into one button would hide the destructive nature of overwriting succeeded → failed_permanent; the two-step click is a deliberate friction.

Caveat: between step 1 and step 2, the doc shows status=Failed in Firestore and any downstream consumer (patient app / clinical UI) will display it as failed. Keep the gap short.

Not for: jobs that genuinely failed (Retry alone handles those, no Mark-failed-for-retry needed). Not for: Omni-Health re-runs that just need OCR re-edited (use Regenerate report with edited OCR instead — it doesn't touch Firestore status).

Why this exists

Pipeline code from teammates can write intermediate files inside their own repo directories — sometimes without including a patient ID in the filename. With one worker that's fine. With two or more workers running at the same time, two concurrent jobs of the same service would overwrite each other's intermediates, producing wrong or corrupted reports.

The sandbox solves this by giving every job a private filesystem view: a per-run scratch directory that the pipeline writes into, with the real patient base copied in at the start and produced outputs copied back out at the end. The rest of the filesystem is mounted read-only, so any rogue write outside the planned area fails fast with EROFS instead of silently colliding with another worker.

How a sandboxed run looks (the "data courier" model)

Allocator mints a fresh scratch dir at /var/tmp/lomi-scratch/run-<run_id>/ with three empty subdirs: base/, user-data/, repo-writes/.
Allocator copies the pipeline's inputs (patient_info, raw data, CSVs, dep outputs from other services, clinical PDFs) from the real patient base into scratch/base/ and scratch/user-data/.
Allocator launches the bash script via bwrap (bubblewrap) with: root filesystem read-only, scratch read-write, /tmp as fresh per-run tmpfs, and ~/.cache rw-bind to the host's real cache directory (so R sesame / BiocFileCache / HuggingFace / pip caches stay visible AND writable — SQLite-backed caches like BiocFileCache need to update access metadata even on read), and the per-attempt run.log single-file-bound to the real on-disk log so the dashboard can still tail it live.
Inside the sandbox the bash script sees the env vars LOMI_BASE_DIR=<scratch>/base and LOMI_USER_DIR=<scratch>/user-data. It reads inputs and writes outputs to these paths.
If exit 0: allocator copies the produced outputs out of scratch/ back to the real patient base and the real user-data dir.
Either way (success or failure): the scratch dir is rm -rf'd.

What changed in the bash scripts

The three pipeline runner scripts now read LOMI_BASE_DIR / LOMI_USER_DIR / LOMI_MINICONDA_BASE with sensible defaults:

BASE_DIR="${LOMI_BASE_DIR:-/data/ql-patient-base/Patients}"
USER_DIR="${LOMI_USER_DIR:-/data/ql-user-data/Patients}"
CONDA_BASE="${LOMI_MINICONDA_BASE:-$HOME/miniconda3}"

Inside a sandbox, the allocator sets the first two to point at the scratch dir; outside the sandbox they fall back to the real paths (matching the pre-sandbox behaviour exactly). The conda one is independent of the sandbox — it just removes the hardcoded /home/lingzhen/miniconda3 so any operator in the auto-job group can run the system on their own home dir.

The toggle (turn it off in an emergency)

The sandbox can be disabled at runtime from the Config modal: find the SANDBOX_ENABLED tunable, set it to 0, save, then restart the allocator (the Sysadmin tab has a ready-to-copy restart command). When disabled, every job runs on the legacy direct-to-disk path — byte-for-byte identical to how the system ran before the sandbox was added. In that mode the allocator also caps MAX_WORKERS at 1 automatically (data corruption with multiple un-sandboxed workers is worse than slow throughput).

Alternative: set the environment variable LOMI_SANDBOX_ENABLED=0 in the systemd unit file. The dashboard toggle takes precedence on every restart, so the env var only matters as a startup default; if both are set, the dashboard value wins. Use the dashboard toggle for "I want to flip this now", use the env var for "this host can never run bwrap".

Whitelist maintenance (what to do when a new pipeline is added)

Most pipeline writes go into the patient base, which is already handled by the data-courier flow. But some teammates' code writes into the pipeline repo itself — for example epi-insight-auto-pipeline-v2/output/debug_report.html is written without a patient ID qualifier. Those writes can't be allowed into the read-only root, so they get redirected into the scratch dir via an explicit whitelist in config.SANDBOX_REPO_WRITE_WHITELIST.

When you add a new pipeline (or notice an existing pipeline starts failing with EROFS or Read-only file system in the run.log), here's the procedure:

Reproduce the failure: kick off a run of the affected service, wait for it to fail, open the run.log from the dashboard.
Look for the error line. Two typical patterns:
- OSError: [Errno 30] Read-only file system: '/data/auto-job/pipelines/<repo>/<path>'
- open(...) failed: Read-only file system
If the path is repo-internal (under /data/auto-job/pipelines/<repo>/) and the write is genuinely necessary (don't whitelist a write that should have gone to patient base — fix the pipeline code for that), add a tuple to SANDBOX_REPO_WRITE_WHITELIST in config.py:
```
SANDBOX_REPO_WRITE_WHITELIST: list[tuple[str, str]] = [
    ...
    ("<repo_name>", "<path_relative_to_repo>"),
]
```
For a directory entry, pass the directory path ("output"). For a single file entry, pass the file path ("style.css"). The allocator handles both correctly.
Restart the allocator. Verify the run now completes.
For repo-internal writes the allocator did NOT cause — for example a python __pycache__ — the right fix is usually an environment variable, not a whitelist entry. For __pycache__ we set PYTHONDONTWRITEBYTECODE=1 by default; analogous patterns exist for R, Java, etc. Prefer env-var solutions when possible — they're cleaner than per-file binds.

Troubleshooting

Allocator log says "sandbox off (bwrap not found in PATH)" — bwrap isn't installed. On Alibaba Cloud Linux 3 / RHEL / CentOS: sudo yum install bubblewrap. On Ubuntu / Debian: sudo apt install bubblewrap. Restart allocator.
Sandbox available but a specific job fails with EROFS — new repo-internal write path. Follow the whitelist procedure above.
Job succeeds inside bwrap but allocator logs "Harvest failed" — the post-run copy from scratch back to patient base ran into a filesystem error (disk full, perms changed, etc). The job is reported as FAILED even though exit was 0; the scratch has been wiped. Check disk space, fix the perms, retry.
Real-time run.log tail in dashboard shows nothing — the single-file bind didn't take. Check the allocator log for "bwrap" related errors. As a quick test, set SANDBOX_ENABLED=0 in config, restart, and verify the log appears in the legacy path; if yes, sandbox is the culprit.
Orphan scratch dirs piling up under /var/tmp/lomi-scratch — the allocator does an orphan sweep at startup based on which run IDs are still in flight. If you see scratch dirs whose run IDs are NOT in the dashboard, that's a sign the allocator was killed hard (e.g. kill -9) and then didn't restart cleanly. They'll be swept on next restart; or you can safely rm -rf them by hand if disk pressure is tight.

History — why bubblewrap (and not our previous overlay setup)

The first sandbox implementation used unshare + per-repo overlayfs mounts. It worked in PoC on a clean ext4 filesystem, but failed in production with EOVERFLOW during overlay copy_up of root-owned directories on this kernel (Linux 5.10 + Alibaba Cloud Linux 3 + unprivileged user namespace). After eight unsuccessful workaround attempts — each pinned to a different hypothesis (32-bit stat, xino, ext4 64bit, owner mapping, sub-uid, etc.) — we switched to bubblewrap, which side-steps overlayfs entirely. The PoC suite for bwrap passed 27/27 isolation and transparency checks on the same kernel.

This feature batch-reruns jobs: it will first Mark failed for retry the selected jobs, then Retry them. This is designed for bug fix or update. Please check that the pipeline scripts have been updated to the required version first.

If you are authorized and confirm to continue, select jobs below and click Continue. Do not close or refresh the page while the batch is running.

	Doc	Service	State	Patient PID	Status
loading…

0 selected

Edit numeric tunables. Changes are saved to SQLite and read by the allocator on its next periodic tick (or after restart for items marked restart-required). Static config (paths, URLs) must be edited in config.py.

Key	Default	Current	Restart?	Last updated
loading…

Confirm to force update — repo version? This discards all local changes in the repo (staged, unstaged, and tracked-but-modified files), and the operation cannot be undone. If you are authorized and confirm to continue, choose the targe branch below and click Hard Pull.

Target branch

Job Allocator Dashboard

Jobs Status

System Status

Connectivity

Pipeline Repositories

Analytics

Daily throughput (7 days)

Service distribution (7 days)

Average duration by service (7 days, successful only)

Top failure reasons (7 days)

Process Queue

Recent Jobs

Recent Control Commands

Notifications Records

—

Actions

Attempt history click a row to load that attempt's log

Jobs Status

System Status

Connectivity

Pipeline Repositories

Analytics Week Month

Daily throughput (7 days)

Service distribution (7 days)

Average duration by service (7 days, successful only)

Top failure reasons (7 days)

Process Queue

Recent Jobs

Recent Control Commands

Notifications Records

Help

What this is

Architecture (one allocator process, several threads)

Data flow

Source of truth

Two-step write order matters

Terminology cheat sheet

The 9 states

Active (in-flight, never auto-cleaned)

Terminal — succeeded

Non-terminal failure (will retry)

Terminal failure (will not auto-retry)

Listener-replay dedup

Why state changes always go: Firestore → SQLite

Header

Jobs by state (5 cards)

System status

Connectivity

Analytics

Manual Add Job

Process Queue

Recent jobs

Drawer (per-doc detail)

Recent control commands

Notifications log

⚙ Config (runtime tunables)

1. Pipeline returns non-zero (regular failure)

2. Pipeline returns exit 75 (data not ready on OSS)

3. Pipeline succeeded but Firestore won't accept the Failed write

4. Firestore network outage

5. Allocator process crashes mid-run

6. Listener replay (Firestore SDK reconnect)

7. Admin changes Firestore status while a doc is mid-flight

8. Dashboard kill races a worker that just succeeded

9. Daily 03:00 cleanup

10. Operator-initiated re-run of a succeeded job

Why this exists

How a sandboxed run looks (the "data courier" model)

What changed in the bash scripts

The toggle (turn it off in an emergency)

Whitelist maintenance (what to do when a new pipeline is added)

Troubleshooting

History — why bubblewrap (and not our previous overlay setup)

Jobs

Regenerate report with edited OCR

Manual Add Job

Pending list (0)

Batch Rerun

—

Actions

Attempt history click a row to load that attempt's log

Runtime configuration

Analytics