Skip to content

Commit 815e7cc

Browse files
author
github-actions
committed
docs: add high-level architecture README and technical FEATURES guide
1 parent 71aac22 commit 815e7cc

2 files changed

Lines changed: 298 additions & 97 deletions

File tree

FEATURES.md

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
# EchoLocate Technical Features
2+
3+
This document describes EchoLocate from an implementation perspective: what each feature does, how it works under the hood, and why it matters for real-time accessibility.
4+
5+
## 1. Runtime Architecture
6+
7+
EchoLocate is a browser-only application deployed as static files.
8+
9+
Core runtime files:
10+
11+
1. `index.html` — semantic UI structure, controls, accessibility landmarks
12+
2. `style.css` — adaptive layout, dark/light theming, responsive behavior, confidence visuals
13+
3. `app.js` — speech recognition, audio analysis, speaker lane logic, persistence, export
14+
4. `sw.js` — local fragment rendering API for HTMX (`/api/add-card`, `/api/add-chat-msg`)
15+
16+
Design principle:
17+
18+
- No backend is required for standard operation.
19+
- Transcript and analysis remain in-browser.
20+
21+
## 2. Caption Pipeline
22+
23+
Speech recognition path:
24+
25+
1. Browser mic capture via `getUserMedia`
26+
2. `SpeechRecognition` / `webkitSpeechRecognition` receives audio stream
27+
3. Interim and final transcript events are produced
28+
4. Final transcript is combined with speaker profile metadata
29+
5. Card payload is posted to local route intercepted by Service Worker
30+
6. Returned HTML fragment is inserted into lane/chat containers
31+
32+
Reliability behavior:
33+
34+
- A watchdog timer restarts recognition when no results are received for a configured interval.
35+
- Warm restart logic handles unexpected `onend` while the app is still running.
36+
37+
## 3. Audio Feature Extraction
38+
39+
Audio analysis path uses Web Audio API + Meyda.
40+
41+
Per-frame extracted features include:
42+
43+
1. MFCC (13 coefficients)
44+
2. Spectral flatness
45+
3. Spectral slope
46+
4. Spectral centroid
47+
5. Spectral rolloff
48+
6. Zero crossing rate (ZCR)
49+
7. RMS energy
50+
51+
Feature vectors are collected during utterance windows and aggregated to represent voice texture rather than a single pitch scalar.
52+
53+
## 4. Voice Fingerprint + Lane Matching
54+
55+
### 4.1 Vector fingerprint
56+
57+
For each utterance, EchoLocate constructs a voice fingerprint vector from timbral and spectral features.
58+
59+
### 4.2 Similarity scoring
60+
61+
Each active lane profile is compared with the incoming fingerprint using cosine similarity:
62+
63+
$$
64+
\text{similarity} = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}
65+
$$
66+
67+
### 4.3 Assignment behavior
68+
69+
1. If similarity clears threshold, assign utterance to the best matching lane
70+
2. If not, create a new guest lane (up to max speaker limit)
71+
3. Lane profiles update incrementally with each new utterance
72+
73+
Why this matters:
74+
75+
- Timbral vectors are more robust than pitch-only matching when a speaker changes intonation.
76+
77+
## 5. Anti-Flicker Stability
78+
79+
To reduce lane hopping:
80+
81+
1. Hysteresis lock: keep recent lane preference for a minimum duration unless a meaningfully better candidate appears
82+
2. Temporal smoothing: evaluate recent match history (median/majority over last N decisions)
83+
84+
Result:
85+
86+
- Better sentence continuity and fewer rapid lane switches.
87+
88+
## 6. Language Handling
89+
90+
Language features include:
91+
92+
1. Selectable recognition language list, including `None (Auto)` mode
93+
2. Optional text-based language detection fallback using `franc-min`
94+
3. Visual feedback when detected text language and selected recognition language diverge
95+
96+
Purpose:
97+
98+
- Make multilingual conversation behavior visible and debuggable for users.
99+
100+
## 7. Accessibility-Centered UI Features
101+
102+
### 7.1 Lane and chat views
103+
104+
- Lanes view: parallel speaker columns for at-a-glance differentiation
105+
- Chat view: single stream useful on mobile and constrained screens
106+
107+
### 7.2 Active lane indicator
108+
109+
- Current lane receives an energy ring / active state styling to show where focus is landing.
110+
111+
### 7.3 Confidence visibility
112+
113+
- Confidence meter (0-100%) is attached to transcript cards/messages
114+
- Low-confidence text is visually marked
115+
116+
### 7.4 Human-in-the-loop correction
117+
118+
- Merge controls allow users to merge two speaker lanes when automatic grouping splits one person into multiple lanes.
119+
120+
## 8. Service Worker Local Rendering API
121+
122+
`sw.js` behaves as a local fragment server:
123+
124+
1. `/api/add-card` (POST): returns lane card fragment
125+
2. `/api/add-chat-msg` (POST): returns chat message fragment
126+
3. `/api/clear` (POST): local clear acknowledgment
127+
128+
Security behavior:
129+
130+
- Inputs are escaped/sanitized before HTML output
131+
- Attribute-safe escaping is applied for user-provided content
132+
133+
Operational advantages:
134+
135+
- No remote templating required
136+
- HTMX interactions stay local
137+
- Offline resilience with cache fallback for same-origin assets
138+
139+
## 9. Persistence Model
140+
141+
Storage:
142+
143+
- Session cards are stored in `localStorage` (`echolocate_v1`)
144+
- Startup restore rebuilds lanes and chat view from stored cards
145+
- Clear operation wipes stored conversation state
146+
147+
Tradeoff:
148+
149+
- Fast and private, but scoped to browser/device profile.
150+
151+
## 10. Export Model (VTT)
152+
153+
Exported transcript format:
154+
155+
- WebVTT with speaker metadata tags, e.g. `<v Speaker 1>...</v>`
156+
- Time windows are normalized relative to first utterance
157+
158+
Benefit:
159+
160+
- Better interoperability with subtitle tools and downstream review workflows.
161+
162+
## 11. Privacy and Offline Operation
163+
164+
Privacy posture:
165+
166+
1. Audio remains local in browser
167+
2. Transcript content is not sent to external cloud APIs by default
168+
3. Processing and rendering happen on device
169+
170+
Offline posture:
171+
172+
1. Key dependencies are vendored in `vendor/`
173+
2. Local server (`server.py`) supports localhost operation
174+
3. Service Worker provides local route handling and cache fallback
175+
176+
## 12. Dependency and Platform Constraints
177+
178+
Required browser capabilities:
179+
180+
1. Web Speech API (best support: Chromium-based desktop browsers)
181+
2. Web Audio API
182+
3. Service Worker support
183+
184+
Known constraint:
185+
186+
- Browsers without Web Speech API support cannot provide live transcription in this architecture.
187+
188+
## 13. Why This Stack Is Useful for Accessibility
189+
190+
EchoLocate is optimized for practical meeting use where reliability and transparency matter:
191+
192+
1. Voice texture matching improves speaker grouping stability
193+
2. Watchdog recovery reduces silent transcript dropouts
194+
3. Confidence and mismatch indicators expose uncertainty instead of hiding it
195+
4. Merge controls let users correct AI mistakes quickly
196+
5. Fully local execution supports privacy-sensitive environments
197+
198+
---
199+
200+
For implementation details, see:
201+
202+
- [README.md](README.md)
203+
- [INSTALL.txt](INSTALL.txt)
204+
- [AGENTS.md](AGENTS.md)

0 commit comments

Comments
 (0)