Supporting Reference 4: Logging and Failure Handling
This page defines where failures are recorded and how to interpret logs during operations and incident response.
R4.1 Logging design rule
Failures must be explicit, traceable, and recoverable. The system favors visible stage errors and DB state over silent partial success.
R4.2 Primary log surfaces
logs/latest_run.log - orchestrator and stage execution stream from main_set.py.
- Startup runtime marker in
latest_run.log - [INFO] Ollama startup check: model=... processor=GPU/CPU context=... vram=....
logs/db_uploader.log - uploader events and publish-side errors from db_uploader.py.
data/crash_runtime.log - callback/runtime crash traces (for example add-set event failures).
data/prefill_qc_last.json - duplicate/suspicious prefill QC report generated before review editor opens.
data/run_log.txt - run summary and high-level stage notes.
crash_startup.log - startup crash trace (when available, near script/EXE root).
R4.3 Typical failure domains
- Ollama not reachable or required model missing during prefill stage.
- Ollama starts in CPU mode unexpectedly (GPU bootstrapping/env mismatch).
- Primary caption model failure requiring row-level retries (and optional fallback model retries when configured).
- Stage 6 child timeout/quarantine events preventing endless loops on repeated native crash rows.
- Generated metadata duplication/suspicious text detected by prefill QC scan.
- Quality scoring dependency/runtime failures (including interpreter/DLL mismatch).
- Score completeness regressions (for example null
brisque_score or clip_aesthetic_score).
- Resize/write issues causing
QC_Status='ResizeFailed'.
- FTP/MySQL connectivity or authentication failures during publish.
- Filename collision or reservation inconsistencies.
- Text quality artifacts that require sentence cleanup (hyphen/grammar drift).
R4.4 State-to-log correlation
Log analysis should be correlated with queue state in review_queue.
- Use
Review_Status to scope lifecycle phase of failures.
- Use
QC_Status to identify quality/resize-related issues quickly.
- Use
[RETRY]/[RETRY-OK] lines to verify optional fallback model scope and success when fallback is enabled.
- Use timeout/quarantine lines (for example idle timeout warnings and
PrefillNativeCrash) to identify rows isolated from continued retries.
- For publish problems, cross-check row decisions against uploader failures by
File_Name.
R4.5 Triage workflow
- Identify first fatal/error in
logs/latest_run.log.
- If publish path involved, inspect first failing row in
logs/db_uploader.log.
- Query queue rows to estimate impact scope.
- Apply targeted fix and re-run through standard workflow.
- If partial state is uncertain, follow rollback-aware runbook path.
R4.6 Retention and review habits
- Keep core operational logs (
latest_run.log, db_uploader.log) for recent batches during active tuning periods.
- Capture the startup processor line for run evidence when diagnosing CPU/GPU runtime differences.
- Treat dry-run outputs as disposable after validation; clean temporary compare/test logs to avoid noise growth.
- Capture first error line and row/file identifiers when reporting incidents.
- Avoid overwriting troubleshooting context before root cause is confirmed.