Supporting Reference 4: Logging and Failure Handling
This page defines where failures are recorded and how to interpret logs during operations and incident response.
R4.1 Logging design rule
Failures must be explicit, traceable, and recoverable. The system favors visible stage errors and DB state over silent partial success.
R4.2 Primary log surfaces
logs/latest_run.log - orchestrator and stage execution stream from main_set.py.
logs/db_uploader.log - uploader events and publish-side errors from db_uploader.py.
data/crash_runtime.log - callback/runtime crash traces (for example add-set event failures).
data/run_log.txt - run summary and high-level stage notes.
crash_startup.log - startup crash trace (when available, near script/EXE root).
R4.3 Typical failure domains
- Ollama not reachable or required model missing during prefill stage.
- Quality scoring dependency/runtime failures (including interpreter/DLL mismatch).
- Resize/write issues causing
QC_Status='ResizeFailed'.
- FTP/MySQL connectivity or authentication failures during publish.
- Filename collision or reservation inconsistencies.
R4.4 State-to-log correlation
Log analysis should be correlated with queue state in review_queue.
- Use
Review_Status to scope lifecycle phase of failures.
- Use
QC_Status to identify quality/resize-related issues quickly.
- For publish problems, cross-check row decisions against uploader failures by
File_Name.
R4.5 Triage workflow
- Identify first fatal/error in
logs/latest_run.log.
- If publish path involved, inspect first failing row in
logs/db_uploader.log.
- Query queue rows to estimate impact scope.
- Apply targeted fix and re-run through standard workflow.
- If partial state is uncertain, follow rollback-aware runbook path.
R4.6 Retention and review habits
- Keep preflight reports and run logs for at least recent batches during active tuning periods.
- Capture first error line and row/file identifiers when reporting incidents.
- Avoid overwriting troubleshooting context before root cause is confirmed.