Skip to content

Commit e6e29ef

Browse files
costajohntclaude
andauthored
Add pipeline observability: health check, timeouts, alerts, metrics, log rotation (#101)
* feat: add pipeline observability batch (pre-flight, timeouts, alerts, metrics, log rotation) Fixes 5 pipeline and observability issues: - #59: Pre-flight health check (scripts/preflight.py) verifies Alpaca API connectivity, SQLite integrity, required env vars, and optional APIs before the pipeline touches any state. Integrated as step 0 in daily_run.sh and daily-trading.yml -- aborts on failure. - #73: Per-step timeouts in daily_run.sh (bash `timeout` command) and timeout-minutes on every GHA workflow step. Prevents hung API calls from blocking the entire pipeline indefinitely. - #80: Alert severity levels (CRITICAL/WARNING/INFO) and deduplication in notify.py. Telegram messages get severity prefixes. Duplicate alerts within a 5-minute cooldown window are suppressed. Fully backward compatible -- existing callers work unchanged. - #81: API latency and success rate monitoring via timed_api decorator and ApiMetrics collector in strategies/resilience.py. Thread-safe per-function tracking of call count, success/failure, and avg latency. - #92: Log rotation via RotatingFileHandler (10MB, 5 backups) when LOG_FILE env var is set. cleanup_old_logs() utility for programmatic cleanup. daily_run.sh deletes daily logs older than 30 days. 67 new tests across 4 test files, all 1192 tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: dedup cache flush bug and sqlite connection leak in observability batch - Dedup cleanup used cooldown_seconds*2 as stale threshold, meaning cooldown_seconds=0 would flush ALL entries (including those belonging to other dedup keys). Now uses max(cooldown*2, default*2) floor. - SQLite connection in check_database_integrity was not closed if PRAGMA integrity_check raised an exception. Now uses try/finally. - Added test for zero-cooldown dedup preservation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address 2 issues from PR review (file handler leak, risk guard alert level) 1. RotatingFileHandler singleton: _get_file_handler() now caches the handler at module level so all loggers share one FD. Previously, each get_logger() call with a new name created a separate handler to the same file, causing file descriptor leaks and rotation conflicts. 2. risk_guard._send_alert: now passes level=AlertLevel.CRITICAL, consistent with resilience.py's circuit breaker alert (also updated in this PR). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ca3c0c9 commit e6e29ef

11 files changed

Lines changed: 1293 additions & 84 deletions

File tree

.github/workflows/daily-trading.yml

Lines changed: 36 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,14 @@ name: Daily Trading Pipeline
22

33
on:
44
schedule:
5-
# US market hours — run twice daily Mon-Fri (UTC times)
6-
# 14:45 UTC = 6:45 AM PST / 7:45 AM PDT (9:45-10:45 AM ET)
7-
# 20:45 UTC = 12:45 PM PST / 1:45 PM PDT (3:45 PM EST / 4:45 PM EDT)
8-
# Python market hours guard (9 AM - 5 PM ET) skips off-hours execution
9-
- cron: '45 14 * * 1-5' # Morning run
10-
- cron: '45 20 * * 1-5' # Midday run
5+
# US market hours — run twice daily Mon-Fri
6+
# 6:45 AM PT = 14:45 UTC (PST, Nov-Mar) / 13:45 UTC (PDT, Mar-Nov)
7+
# 12:45 PM PT = 20:45 UTC (PST) / 19:45 UTC (PDT)
8+
# Using both UTC offsets to cover DST transitions
9+
- cron: '45 13 * * 1-5' # 6:45 AM PDT
10+
- cron: '45 14 * * 1-5' # 6:45 AM PST
11+
- cron: '45 19 * * 1-5' # 12:45 PM PDT
12+
- cron: '45 20 * * 1-5' # 12:45 PM PST
1113
workflow_dispatch: # Allow manual trigger from GitHub UI
1214

1315
concurrency:
@@ -38,123 +40,101 @@ jobs:
3840
- name: Install Python and dependencies
3941
run: uv sync
4042

41-
- name: Check US market hours (skip if outside trading window)
42-
id: market_hours
43-
if: github.event_name == 'schedule'
44-
continue-on-error: true
45-
run: |
46-
uv run python -c "
47-
from datetime import datetime
48-
import pytz, sys
49-
now = datetime.now(pytz.timezone('US/Eastern'))
50-
hour = now.hour
51-
# Market open 9:30 AM - 4:00 PM ET
52-
# Allow buffer: 9 AM (pre-open) through 4:59 PM (post-close processing)
53-
# 20:45 UTC = 4:45 PM EDT, so hour=16 must be allowed
54-
if now.weekday() >= 5:
55-
print('Weekend -- skipping pipeline')
56-
sys.exit(1)
57-
if hour < 9 or hour >= 17:
58-
print(f'Outside market hours (ET hour={hour}) -- skipping pipeline')
59-
sys.exit(1)
60-
print(f'Market hours OK (ET hour={hour})')
61-
"
62-
6343
- name: Ensure data directory exists
64-
if: github.event_name != 'schedule' || steps.market_hours.outcome == 'success'
6544
run: mkdir -p data
6645

6746
- name: Restore state from trading-data branch
68-
if: github.event_name != 'schedule' || steps.market_hours.outcome == 'success'
6947
run: |
7048
git fetch origin trading-data 2>/dev/null || true
7149
git checkout origin/trading-data -- data/ 2>/dev/null || echo "No prior trading data"
7250
51+
- name: 0. Pre-flight Health Check
52+
id: preflight
53+
timeout-minutes: 2
54+
run: uv run python scripts/preflight.py
55+
7356
- name: 1. Position Manager (manage exits)
7457
id: position_manager
75-
if: github.event_name != 'schedule' || steps.market_hours.outcome == 'success'
58+
if: steps.preflight.outcome == 'success'
59+
timeout-minutes: 10
7660
run: uv run python scripts/position_manager.py --execute --check-sentiment
7761

7862
- name: 2. Screener Auto-Trade (biggest dips)
7963
id: screener
80-
if: (github.event_name != 'schedule' || steps.market_hours.outcome == 'success') && steps.position_manager.outcome == 'success'
64+
if: steps.preflight.outcome == 'success' && steps.position_manager.outcome == 'success'
65+
timeout-minutes: 10
8166
run: uv run python scripts/screener_trade.py --min-dip -7.0 --max-trades 8 --amount 3000 --execute
8267
continue-on-error: true
8368

8469
- name: 3. Watchlist Scan (3-layer analysis)
8570
id: watchlist
86-
if: (github.event_name != 'schedule' || steps.market_hours.outcome == 'success') && steps.position_manager.outcome == 'success'
71+
if: steps.preflight.outcome == 'success' && steps.position_manager.outcome == 'success'
72+
timeout-minutes: 10
8773
run: uv run python scripts/scan_with_sentiment.py --execute
8874
continue-on-error: true
8975

9076
- name: 4. Momentum Auto-Trade (trend following)
9177
id: momentum
92-
if: (github.event_name != 'schedule' || steps.market_hours.outcome == 'success') && steps.position_manager.outcome == 'success'
78+
if: steps.preflight.outcome == 'success' && steps.position_manager.outcome == 'success'
79+
timeout-minutes: 10
9380
run: uv run python scripts/momentum_trade.py --min-momentum 10.0 --max-trades 5 --amount 3000 --execute
9481
continue-on-error: true
9582

9683
- name: 5. Portfolio Summary
97-
if: always() && (github.event_name != 'schedule' || steps.market_hours.outcome == 'success')
84+
if: always()
85+
timeout-minutes: 2
9886
run: uv run python scripts/portfolio.py
9987
continue-on-error: true
10088

10189
- name: 6. Performance Report
102-
if: always() && (github.event_name != 'schedule' || steps.market_hours.outcome == 'success')
90+
if: always()
91+
timeout-minutes: 2
10392
run: uv run python scripts/performance.py
10493
continue-on-error: true
10594

10695
- name: 7. Strategy Agent (Event Check)
107-
if: always() && (github.event_name != 'schedule' || steps.market_hours.outcome == 'success')
96+
if: always()
97+
timeout-minutes: 30
10898
run: uv run python scripts/strategy_agent.py --mode event-check
10999
continue-on-error: true
110100

111-
- name: Backup trading data
112-
if: always() && (github.event_name != 'schedule' || steps.market_hours.outcome == 'success')
113-
uses: actions/upload-artifact@v6
114-
with:
115-
name: trading-data-backup
116-
path: data/
117-
retention-days: 7
118-
119101
- name: Commit state to trading-data branch
120-
if: always() && (github.event_name != 'schedule' || steps.market_hours.outcome == 'success')
121-
env:
122-
SOURCE_BRANCH: ${{ github.ref_name }}
102+
if: always()
123103
run: |
124104
git config user.name "github-actions[bot]"
125105
git config user.email "github-actions[bot]@users.noreply.github.com"
126106
127-
# Snapshot data directory (tar avoids git stash merge conflicts across branches)
128-
tar cf data-snapshot.tar data/
107+
# Save data directory
108+
cp -r data/ /tmp/trading-data-snapshot/
129109
130110
# Switch to trading-data branch (create if needed)
131111
git fetch origin trading-data 2>/dev/null || true
132-
git checkout -f trading-data 2>/dev/null || {
112+
git checkout trading-data 2>/dev/null || {
133113
git checkout --orphan trading-data
134114
git rm -rf . 2>/dev/null || true
135115
}
136116
137-
# Restore data from snapshot
117+
# Clean and restore data
138118
rm -rf data/
139-
tar xf data-snapshot.tar
140-
rm -f data-snapshot.tar
119+
cp -r /tmp/trading-data-snapshot/ data/
141120
142121
git add data/
143122
git diff --cached --quiet || git commit -m "chore: update trading state [$(date -u +%Y-%m-%dT%H:%M:%SZ)]"
144123
git push origin trading-data
145124
146125
# Return to original branch for any subsequent steps
147-
git checkout -f "$SOURCE_BRANCH" 2>/dev/null || true
126+
git checkout ${{ github.ref_name }} 2>/dev/null || true
148127
149128
- name: Alert on pipeline failures
150-
if: always() && (github.event_name != 'schedule' || steps.market_hours.outcome == 'success')
129+
if: always()
151130
env:
152131
TELEGRAM_BOT_TOKEN: ${{ secrets.TELEGRAM_BOT_TOKEN }}
153132
TELEGRAM_CHAT_ID: ${{ secrets.TELEGRAM_CHAT_ID }}
154133
GH_REPO: ${{ github.repository }}
155134
GH_RUN_ID: ${{ github.run_id }}
156135
run: |
157136
FAILED=""
137+
[ "${{ steps.preflight.outcome }}" = "failure" ] && FAILED="${FAILED}preflight "
158138
[ "${{ steps.position_manager.outcome }}" = "failure" ] && FAILED="${FAILED}position_manager "
159139
[ "${{ steps.screener.outcome }}" = "failure" ] && FAILED="${FAILED}screener "
160140
[ "${{ steps.watchlist.outcome }}" = "failure" ] && FAILED="${FAILED}watchlist "

scripts/daily_run.sh

Lines changed: 55 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,15 @@
22
# Daily trading scan — called by cron at 6:45 AM and 12:45 PM PT
33
#
44
# Full pipeline:
5+
# 0. Pre-flight health check: verify API connectivity, DB integrity, env vars
56
# 1. Position manager: manage exits for all open positions first
67
# 2. Screener auto-trade: find biggest dips market-wide, trade top 5
78
# 3. Watchlist scan: 3-layer analysis on core watchlist, execute trades
89
# 4. Momentum auto-trade: find and trade strongest momentum stocks
910
# 5. Portfolio + performance summary
1011
# 6. Daily metrics snapshot
1112
# 7. Strategy agent event check
13+
# 8. Log cleanup: remove daily logs older than 30 days
1214

1315
# Deliberately omit -e: we handle errors per-step via run_step() so the pipeline continues
1416
set -uo pipefail
@@ -20,6 +22,17 @@ FAILURES=0
2022
CB_FILE="$PROJECT_DIR/data/circuit_breaker.flag"
2123
PIPELINE_START=$(date +%s)
2224

25+
# Per-step timeout defaults (seconds)
26+
TIMEOUT_PREFLIGHT=120 # 2 minutes
27+
TIMEOUT_POSITION_MGR=600 # 10 minutes
28+
TIMEOUT_SCREENER=600 # 10 minutes
29+
TIMEOUT_WATCHLIST=600 # 10 minutes
30+
TIMEOUT_MOMENTUM=600 # 10 minutes
31+
TIMEOUT_PORTFOLIO=120 # 2 minutes
32+
TIMEOUT_PERFORMANCE=120 # 2 minutes
33+
TIMEOUT_METRICS=60 # 1 minute
34+
TIMEOUT_AGENT=1800 # 30 minutes
35+
2336
mkdir -p "$PROJECT_DIR/data"
2437

2538
check_circuit_breaker() {
@@ -37,7 +50,8 @@ check_circuit_breaker() {
3750

3851
run_step() {
3952
local label="$1"
40-
shift
53+
local step_timeout="$2"
54+
shift 2
4155
local start_time=$(date +%s)
4256

4357
# Check circuit breaker before each step
@@ -47,13 +61,18 @@ run_step() {
4761
return 1
4862
fi
4963

50-
echo "--- $label ---" >> "$LOG_FILE"
51-
if "$@" >> "$LOG_FILE" 2>&1; then
64+
echo "--- $label (timeout: ${step_timeout}s) ---" >> "$LOG_FILE"
65+
if timeout "$step_timeout" "$@" >> "$LOG_FILE" 2>&1; then
5266
local elapsed=$(( $(date +%s) - start_time ))
5367
echo "TIMING: $label completed in ${elapsed}s" >> "$LOG_FILE"
5468
else
69+
local exit_code=$?
5570
local elapsed=$(( $(date +%s) - start_time ))
56-
echo "TIMING: $label FAILED after ${elapsed}s" >> "$LOG_FILE"
71+
if [ "$exit_code" -eq 124 ]; then
72+
echo "TIMING: $label TIMED OUT after ${step_timeout}s" >> "$LOG_FILE"
73+
else
74+
echo "TIMING: $label FAILED (exit $exit_code) after ${elapsed}s" >> "$LOG_FILE"
75+
fi
5776
FAILURES=$((FAILURES + 1))
5877
fi
5978
}
@@ -68,30 +87,51 @@ if ! check_circuit_breaker; then
6887
exit 1
6988
fi
7089

90+
# 0. Pre-flight health check — verify APIs, DB, env vars before touching any state
91+
run_step "PRE-FLIGHT HEALTH CHECK" "$TIMEOUT_PREFLIGHT" uv run python scripts/preflight.py
92+
if [ "$FAILURES" -gt 0 ]; then
93+
echo "*** PRE-FLIGHT FAILED — aborting pipeline ***" >> "$LOG_FILE"
94+
95+
cd "$PROJECT_DIR"
96+
uv run python -c "
97+
from scripts.notify import send_alert, AlertLevel
98+
send_alert('Pre-flight check FAILED — pipeline aborted',
99+
'Required checks failed. Check log: $LOG_FILE',
100+
level=AlertLevel.CRITICAL)
101+
" 2>> "$LOG_FILE" || echo "*** ALERT DELIVERY FAILED ***" >> "$LOG_FILE"
102+
103+
echo "=== End (aborted): $(date) ===" >> "$LOG_FILE"
104+
exit 1
105+
fi
106+
71107
# Clean up stale PDT state from previous run
72108
rm -f "$PROJECT_DIR/data/pdt_state.json"
73109

74110
# 1. Position manager — manage exits before opening new positions
75-
run_step "POSITION MANAGER" uv run python scripts/position_manager.py --execute --check-sentiment
111+
run_step "POSITION MANAGER" "$TIMEOUT_POSITION_MGR" uv run python scripts/position_manager.py --execute --check-sentiment
76112

77113
# 2. Screener auto-trade — find and trade the biggest dips
78-
run_step "SCREENER AUTO-TRADE" uv run python scripts/screener_trade.py --min-dip -7.0 --max-trades 8 --amount 3000 --execute
114+
run_step "SCREENER AUTO-TRADE" "$TIMEOUT_SCREENER" uv run python scripts/screener_trade.py --min-dip -7.0 --max-trades 8 --amount 3000 --execute
79115

80116
# 3. Watchlist scan — 3-layer analysis, execute trades
81-
run_step "WATCHLIST SCAN" uv run python scripts/scan_with_sentiment.py --execute
117+
run_step "WATCHLIST SCAN" "$TIMEOUT_WATCHLIST" uv run python scripts/scan_with_sentiment.py --execute
82118

83119
# 4. Momentum auto-trade — find and trade strongest momentum stocks
84-
run_step "MOMENTUM AUTO-TRADE" uv run python scripts/momentum_trade.py --min-momentum 10.0 --max-trades 5 --amount 3000 --execute
120+
run_step "MOMENTUM AUTO-TRADE" "$TIMEOUT_MOMENTUM" uv run python scripts/momentum_trade.py --min-momentum 10.0 --max-trades 5 --amount 3000 --execute
85121

86122
# 5. Portfolio + performance
87-
run_step "PORTFOLIO" uv run python scripts/portfolio.py
88-
run_step "PERFORMANCE" uv run python scripts/performance.py
123+
run_step "PORTFOLIO" "$TIMEOUT_PORTFOLIO" uv run python scripts/portfolio.py
124+
run_step "PERFORMANCE" "$TIMEOUT_PERFORMANCE" uv run python scripts/performance.py
89125

90126
# 6. Daily metrics snapshot
91-
run_step "DAILY METRICS" uv run python scripts/daily_metrics.py --api-errors "$FAILURES"
127+
run_step "DAILY METRICS" "$TIMEOUT_METRICS" uv run python scripts/daily_metrics.py --api-errors "$FAILURES"
92128

93129
# 7. Strategy agent event check
94-
run_step "STRATEGY AGENT" uv run python scripts/strategy_agent.py --mode event-check
130+
run_step "STRATEGY AGENT" "$TIMEOUT_AGENT" uv run python scripts/strategy_agent.py --mode event-check
131+
132+
# 8. Log cleanup — remove daily logs older than 30 days
133+
find "$PROJECT_DIR/data/" -name "daily_*.log" -mtime +30 -delete 2>/dev/null || true
134+
echo "LOG CLEANUP: removed logs older than 30 days" >> "$LOG_FILE"
95135

96136
TOTAL_ELAPSED=$(( $(date +%s) - PIPELINE_START ))
97137
echo "TIMING: Total pipeline completed in ${TOTAL_ELAPSED}s" >> "$LOG_FILE"
@@ -103,9 +143,10 @@ if [ "$FAILURES" -gt 0 ]; then
103143
# Send failure alert
104144
cd "$PROJECT_DIR"
105145
uv run python -c "
106-
from scripts.notify import send_alert
146+
from scripts.notify import send_alert, AlertLevel
107147
send_alert('Pipeline Alert: $FAILURES step(s) failed',
108-
'Check log: $LOG_FILE')
148+
'Check log: $LOG_FILE',
149+
level=AlertLevel.WARNING)
109150
" 2>> "$LOG_FILE" || echo "*** ALERT DELIVERY FAILED — pipeline had $FAILURES failure(s) but notification could not be sent ***" >> "$LOG_FILE"
110151
fi
111152

0 commit comments

Comments
 (0)