Report OCR API (localhost:18000)| Run | Doc | Service | State | Attempt | Patient | pId | Collection date | Received | Duration |
|---|---|---|---|---|---|---|---|---|---|
| loading… | |||||||||
The Longevity.Omics Job Allocator watches the Firestore
reports collection for documents with
status == "Data Ready". For each one it runs the
appropriate bash pipeline (Gen-Decoder, Epi-Insight, or
Omni-Health) on this server, captures logs, retries on failure,
and reports outcomes back to Firestore. This dashboard is the
operator-facing window into all of that.
reports where status == "Data Ready"; pushes every change into an internal event queue.job_runs row in waiting_cooldown (Gen-Decoder gets a 24h hold) or queued (other services).
New Data Ready doc arrives in Firestore →
listener emits ADDED →
dispatcher inserts SQLite row (waiting_cooldown or queued) →
scheduler's cooldown promoter moves cooldown rows to queued when
their scheduled_run_at is up →
worker pops from ready_queue, runs bash run_*.sh,
captures exit code + log →
outcome routed (succeeded / failed_retrying / temp_failed /
failed_sync / failed_permanent), Firestore updated where
appropriate.
/data/auto-job/allocator/data/allocator.db) — the allocator's own state truth. Every job attempt is one row in job_runs. Survives crashes — recovery on next start re-rescues running rows. Active states (received, waiting_cooldown, queued, running) are never auto-cleaned; terminal states clean after 14 / 30 / 90 days.Data Ready means "please process me", Failed means "we tried and gave up", Available means "downstream wrote results, ready to display". Allocator reads Data Ready, writes Failed; downstream writes Available.
Whenever the allocator marks a row terminal, it writes
Firestore first, then SQLite. This closes a race window:
the listener filter (status == "Data Ready") stops
emitting events for the doc the moment Firestore commits, so any
mid-flight reconnect/replay is silenced before SQLite changes.
Skip this order and the listener can resurrect already-failed
rows during the few-millisecond gap.
failed_retrying + temp_failed (probably resolved themselves). "Failed" = failed_permanent only (needs a human).
Every row in job_runs is in exactly one state. The
state determines what happens next (or whether anything will).
Click a Jobs-by-state card on the dashboard to see all rows in a
given state.
scheduled_run_at in the future. Three sources:
notes='sync_retry'; shows up with light orange background in the cooldown queue.ready_queue, waiting for a worker. Cooldown promoter atomically moved it from waiting_cooldown when scheduled_run_at passed.waiting_cooldown. Auto-cleaned after 14 days.waiting_cooldown + 1h. Auto-cleaned after 14 days.waiting_cooldown row at attempt=1 with FAILED_SYNC_RETRY_DELAY_SECONDS (default 24h) — so this is "auto-retry after a long pause"; operators can also Skip from the cooldown queue to release immediately.Firestore SDK occasionally re-emits ADDED for every doc currently matching the query (reconnect, internal refresh). Without dedup the allocator would create a duplicate run row each time. The dispatcher blocks creating a new row when there's already a row for this doc_id in:
succeeded — already done, don't redofailed_retrying — short window inside the worker between marking failed_retrying and inserting the next-attempt rowtemp_failed — same window analysisfailed_sync — has its own auto-retry row, don't double up
Not blocked: failed_permanent. This is on
purpose — the dashboard's "Reset Firestore → Data Ready" button
sets Firestore back to Data Ready and relies on dedup NOT
blocking the resulting ADDED, so a fresh attempt 1 row gets
created. If the only existing row is failed_permanent, listener
replay creates a new attempt — exactly the manual-retry workflow.
For any path that ends a row in failed_permanent,
the allocator writes Firestore Failed FIRST. Once Firestore is
Failed, the listener stops emitting events for this doc, and
SQLite is still running (active, dedup-blocked) so
any in-flight replay is harmless. THEN we flip SQLite to
terminal. The old code did SQLite first and lost a race window
of ~30ms in which listener replay could create a parallel
attempt (the "extra retry round" bug).
The only exception is dispatcher withdraw: admin changed Firestore on their own, so allocator just mirrors to SQLite without touching Firestore back.
Click any card to see all jobs in that state in a modal. Click a row in the modal to drill into the drawer.
running rows. With MAX_WORKERS=1 this is at most 1.queued rows. Should drain quickly unless workers are stuck.waiting_cooldown (timed hold) + failed_sync (Firestore-write retry hold) + manual_hold (Manual Add Job, awaiting operator release). The card sub-label shows the breakdown. Not clickable — sources span two SQLite tables, so go to the Process Queue panel below to see the rows.Three on-demand checks, results cached:
reports collection. Used by the worker too: when a per-doc Firestore write fails, worker calls this to decide whether the whole network is down (→ outage mode) or just the one doc has trouble (→ failed_sync)./data/auto-job/pipelines/Omni-Health/.env. Picks the first generateContent-capable model dynamically (avoids hardcoding model names that Google deprecates).http://127.0.0.1:18000/health.Four charts. Top-right toggle switches all four between Week (7d) and Month (30d):
Click a chart legend to hide/show that series.
For orders where the lab data won't arrive automatically (hospital uses its own ordering system or its own non-partner lab). The order already exists in Firestore but its status is whatever the hospital set — NOT "Data Ready". Manual Add tells the dashboard to track it and email the lab team to upload data.
Adding (click + Manual Add Job in the header):
report_id, click Validate. The dashboard reads the doc, refuses if it's already Data Ready or its service isn't allocator-managed, and adds it to the pending list with all fields auto-filled.report_id (header optional). Each row is validated independently; failures are listed under the upload box and skipped from the pending list.Behind the scenes when Submit lands: for each (patient_id, report_id, service), the dashboard creates two directory trees and a patient_info.json in each, sourcing patient + report fields from Firestore. The patient_info.json is what downstream pipelines will read once the lab uploads data. Idempotent: re-submitting after a cancel reuses the existing files (does NOT re-fetch from Firestore — if the data changed in Firestore, force a rewrite via the setup_patient_info.py --force-rewrite CLI).
After submit, holds appear in the Process Queue (purple H# rows, no timer). The operator's job from there: wait until the lab uploads. When the lab confirms, click Release on that row. See Process Queue below for per-row actions.
What "Release" does differently from a normal Data Ready: writes Firestore status=Data Ready PLUS a manualHoldRelease=true field. The dispatcher reads that flag and skips Gen-Decoder's 24h cooldown (lab already confirmed upload, no need to wait) and skips the upload-reminder email. Other services unaffected.
One unified table of work that's blocked on something. Two row sources are interleaved:
H# in purple): orders added via + Manual Add Job that are waiting for the operator to click Release. No timer — they sit indefinitely. See the Manual Add Job section below.#): timed holds — fresh Gen-Decoder 24h cooldowns, retry delays, sync-retry holds (light orange background) and OCR rerun rows (light blue background).Per-row actions on cooldown rows:
Per-row actions on manual hold rows:
manualHoldRelease=true flag). Listener picks it up. Gen-Decoder skips the 24h cooldown (lab confirmed upload, no need to wait). Pre-checks Firestore current status — if a doc is somehow already Data Ready, the release refuses and keeps the hold in place so you can investigate.Bulk actions on manual holds: tick the checkbox in any hold row's ID cell to enable the toolbar above the table. The select-all checkbox in the header selects every visible hold (cooldown rows are not selectable). Toolbar offers Release all, Cancel all, Clear selection. Best-effort: any hold that fails (e.g. Firestore is already Data Ready) stays in the queue and the others are still processed; failures are summarised in an alert.
doc_id and patient_id (case-insensitive). 300ms debounce. Enter = search now. Esc = clear.Opens when you click a row. Shows:
run.log.<N>) this works precisely; with old single run.log files all attempts share the latest log.Audit trail of every command issued from the dashboard (kill / cancel / restart / etc), with timestamp and result. Useful when something seems weird and you want to know what was clicked recently.
Every email send attempt. Status = ok / failed / skipped (toggle off / no recipients). Lets you confirm an email was actually sent vs got dropped.
Editable without restart unless marked. Click Save to apply; if any restart-required key changed, you get prompted to restart now.
SANDBOX_ENABLED=1 to use values > 1 safely; allocator forces it back to 1 if sandbox is off.1 = isolated (safe for concurrent workers). 0 = legacy direct-to-disk (only safe with MAX_WORKERS=1). See the Sandbox & concurrency help tab.Worker creates a new waiting_cooldown row at attempt+1 with delay = RETRY_DELAY_SECONDS. Cooldown promoter picks it up after the delay, worker runs again. Repeat up to MAX_RETRIES.
After MAX_RETRIES attempts all fail, the last row goes failed_permanent, Firestore Failed gets written, permanent-failure email goes out.
This is the Gen-Decoder convention with the bash team: "I checked OSS and the input files aren't there yet". Worker creates a new waiting_cooldown row at the SAME attempt number (no retry consumed) with delay = TEMP_FAIL_RETRY_DELAY_SECONDS (default 1h). The doc stays in this loop until either:
succeededfailed_permanentfailed_permanentThere is no upper bound on temp_failed loops — a doc whose data never arrives will retry forever. If this is a problem, cancel from dashboard or fix the Firestore state.
Wait, this sounds wrong — a successful pipeline doesn't write Failed. The case in question is: pipeline failed (used up retries → would normally go failed_permanent) BUT the Firestore Failed write itself errors. Worker calls check_connectivity to disambiguate:
failed_sync with [FIRESTORE_SYNC_FAILED] in error_message; worker also inserts a fresh waiting_cooldown row at attempt=1 with FAILED_SYNC_RETRY_DELAY_SECONDS (default 24h). Permanent-failure email goes out. After 24h the cooldown promoter retries the whole pipeline; you can also Skip on the cooldown queue once you've fixed the Firestore-side issue.If check_connectivity itself fails, the allocator concludes the network is down (vs a per-doc issue) and goes into outage mode:
_firestore_outage_event.Restart=always respawns.check_connectivity again. If it passes, normal startup. If not, repeat the sleep+exit cycle.In-flight rows are not touched in outage mode — they stay in running in SQLite. Recovery on the next successful start handles them like any other crashed-while-running row.
SIGKILL, OOM, server reboot, etc. SQLite rows in running are stranded. On next start, recovery scans them and:
attempt < MAX_RETRIES: original row → failed_retrying, new queued row inserted at attempt+1.attempt >= MAX_RETRIES: row marked failed_permanent + Firestore Failed written + permanent-failure email.The SDK silently re-emits ADDED for every doc in the result set every few minutes. Dedup in the dispatcher (see State machine tab) prevents duplicate work — a row in active / succeeded / failed_retrying / temp_failed / failed_sync blocks new inserts. Only failed_permanent passes through dedup, and that's only when the operator resets Firestore to Data Ready (the manual-retry path).
Listener filter status == "Data Ready" stops matching the doc; SDK emits REMOVED. Dispatcher handles REMOVED by withdrawing the SQLite row:
queued / waiting_cooldown / received: marked failed_permanent with "Withdrawn from Firestore". Firestore is NOT touched (admin already chose the new status).running: NOT touched. Letting the bash finish is safer than yanking it mid-write. The new Firestore status will only matter if the run fails (allocator would then write Failed, overwriting admin's choice — narrow race window).The kill command sleeps 1.5s before acting (grace period — clicks are human-paced). After the sleep it re-reads SQLite. If state moved to terminal during the sleep, kill is abandoned with a log message. Otherwise: write Firestore Failed → write SQLite failed_permanent (guarded by WHERE state IN active_states, so if the worker raced past, the UPDATE is a no-op and we log a warning) → SIGTERM the subprocess.
Once per day at server-local 03:00, scheduler runs cleanup_old_rows: deletes terminal rows older than their retention (succeeded 30d / failed_retrying 14d / temp_failed 14d / failed_permanent 90d), prunes ocr_cache (30d) and notifications_log (60d). Active rows are NEVER cleaned up regardless of age.
Use case: pipeline ran to completion, but afterwards the team realized there was a bug in the pipeline code or a bad config. Want to re-run with the fix without manually re-creating the order in Firestore.
Workflow (in the drawer for the succeeded doc):
Why two clicks instead of one: the /api/jobs/.../retry endpoint requires Firestore status=Failed (it's the existing manual-retry path for genuinely-failed jobs). Mark-failed-for-retry is the prep step that gets the system into that state. Combining them into one button would hide the destructive nature of overwriting succeeded → failed_permanent; the two-step click is a deliberate friction.
Caveat: between step 1 and step 2, the doc shows status=Failed in Firestore and any downstream consumer (patient app / clinical UI) will display it as failed. Keep the gap short.
Not for: jobs that genuinely failed (Retry alone handles those, no Mark-failed-for-retry needed). Not for: Omni-Health re-runs that just need OCR re-edited (use Regenerate report with edited OCR instead — it doesn't touch Firestore status).
Pipeline code from teammates can write intermediate files inside their own repo directories — sometimes without including a patient ID in the filename. With one worker that's fine. With two or more workers running at the same time, two concurrent jobs of the same service would overwrite each other's intermediates, producing wrong or corrupted reports.
The sandbox solves this by giving every job a private
filesystem view: a per-run scratch directory that the pipeline
writes into, with the real patient base copied in at the start and
produced outputs copied back out at the end. The rest of the
filesystem is mounted read-only, so any rogue write outside
the planned area fails fast with EROFS instead of
silently colliding with another worker.
/var/tmp/lomi-scratch/run-<run_id>/ with three
empty subdirs: base/, user-data/,
repo-writes/.scratch/base/ and
scratch/user-data/.bwrap (bubblewrap) with: root filesystem read-only,
scratch read-write, /tmp as fresh per-run tmpfs, and ~/.cache
rw-bind to the host's real cache directory (so R sesame /
BiocFileCache / HuggingFace / pip caches stay visible AND
writable — SQLite-backed caches like BiocFileCache need to
update access metadata even on read),
and the per-attempt run.log single-file-bound to
the real on-disk log so the dashboard can still tail it live.LOMI_BASE_DIR=<scratch>/base and
LOMI_USER_DIR=<scratch>/user-data. It reads
inputs and writes outputs to these paths.scratch/ back to the real patient base and the
real user-data dir.rm -rf'd.
The three pipeline runner scripts now read LOMI_BASE_DIR
/ LOMI_USER_DIR / LOMI_MINICONDA_BASE
with sensible defaults:
BASE_DIR="${LOMI_BASE_DIR:-/data/ql-patient-base/Patients}"USER_DIR="${LOMI_USER_DIR:-/data/ql-user-data/Patients}"CONDA_BASE="${LOMI_MINICONDA_BASE:-$HOME/miniconda3}"
Inside a sandbox, the allocator sets the first two to point at the
scratch dir; outside the sandbox they fall back to the real paths
(matching the pre-sandbox behaviour exactly). The conda one is
independent of the sandbox — it just removes the hardcoded
/home/lingzhen/miniconda3 so any operator in the
auto-job group can run the system on their own home dir.
The sandbox can be disabled at runtime from the Config modal:
find the SANDBOX_ENABLED tunable, set it to 0,
save, then restart the allocator (the Sysadmin tab has a
ready-to-copy restart command). When disabled, every job runs on
the legacy direct-to-disk path — byte-for-byte identical to
how the system ran before the sandbox was added. In that mode the
allocator also caps MAX_WORKERS at 1 automatically
(data corruption with multiple un-sandboxed workers is worse than
slow throughput).
Alternative: set the environment variable
LOMI_SANDBOX_ENABLED=0 in the systemd unit file. The
dashboard toggle takes precedence on every restart, so the env var
only matters as a startup default; if both are set, the dashboard
value wins. Use the dashboard toggle for "I want to flip this now",
use the env var for "this host can never run bwrap".
Most pipeline writes go into the patient base, which is already
handled by the data-courier flow. But some teammates' code writes
into the pipeline repo itself — for example
epi-insight-auto-pipeline-v2/output/debug_report.html
is written without a patient ID qualifier. Those writes can't be
allowed into the read-only root, so they get redirected into the
scratch dir via an explicit whitelist in
config.SANDBOX_REPO_WRITE_WHITELIST.
When you add a new pipeline (or notice an existing pipeline starts
failing with EROFS or Read-only file system
in the run.log), here's the procedure:
OSError: [Errno 30] Read-only file system: '/data/auto-job/pipelines/<repo>/<path>'open(...) failed: Read-only file system/data/auto-job/pipelines/<repo>/) and the
write is genuinely necessary (don't whitelist a write that
should have gone to patient base — fix the pipeline code
for that), add a tuple to SANDBOX_REPO_WRITE_WHITELIST
in config.py:
SANDBOX_REPO_WRITE_WHITELIST: list[tuple[str, str]] = [
...
("<repo_name>", "<path_relative_to_repo>"),
]
For a directory entry, pass the directory path
("output"). For a single file entry, pass the
file path ("style.css"). The allocator handles
both correctly.__pycache__ — the right
fix is usually an environment variable, not a whitelist entry.
For __pycache__ we set
PYTHONDONTWRITEBYTECODE=1 by default; analogous
patterns exist for R, Java, etc. Prefer env-var solutions when
possible — they're cleaner than per-file binds.sudo yum install bubblewrap. On Ubuntu /
Debian: sudo apt install bubblewrap. Restart
allocator.SANDBOX_ENABLED=0 in config, restart, and verify
the log appears in the legacy path; if yes, sandbox is the
culprit.kill -9) and
then didn't restart cleanly. They'll be swept on next restart;
or you can safely rm -rf them by hand if disk
pressure is tight.
The first sandbox implementation used unshare +
per-repo overlayfs mounts. It worked in PoC on a
clean ext4 filesystem, but failed in production with
EOVERFLOW during overlay copy_up of root-owned
directories on this kernel (Linux 5.10 + Alibaba Cloud Linux 3
+ unprivileged user namespace). After eight unsuccessful workaround
attempts — each pinned to a different hypothesis (32-bit
stat, xino, ext4 64bit, owner mapping, sub-uid, etc.) — we
switched to bubblewrap, which side-steps overlayfs entirely. The
PoC suite for bwrap passed 27/27 isolation and transparency
checks on the same kernel.
These need shell access on the server (with sudo). They can't run from this dashboard. Hover before clicking Copy if you want to read; the button turns green when copied.
This re-runs the Omni-Health pipeline using the OCR data from the previous successful run. Use this only after editing the OCR on Quantum Life Medical Report Manager.
Doc: —
Manually add jobs (e.g. Oders exist in database but lab data won't arrive automatically). On submission these go into the Process Queue and notification is sent to the correspond personnel wait for lab data upload. Click Release to run the job after lab data is uploaded.
This feature batch-reruns jobs: it will first Mark failed for retry
the selected jobs, then Retry them. This is designed for bug fix
or update. Please check that the pipeline scripts have been updated to
the required version first.
If you are authorized and confirm to continue, select jobs below and
click Continue. Do not close or refresh the page while the batch
is running.
| Doc | Service | State | Patient PID | Status | |
|---|---|---|---|---|---|
| loading… | |||||
Edit numeric tunables. Changes are saved to SQLite and read by the allocator
on its next periodic tick (or after restart for items marked restart-required).
Static config (paths, URLs) must be edited in config.py.
| Key | Default | Current | Restart? | Last updated |
|---|---|---|---|---|
| loading… | ||||