feat(api): added Groq Llama 3.3 70B as cross-provider fallback

sunnypatell · sunnypatell · commit 569f4721b48e · 2026-02-23T16:06:23.000-05:00
- replaced Google Flash/Flash-Lite fallbacks with Groq (independent quota, 14,400 RPD)
- moved JSON validation inside retry loop so bad responses trigger next provider
- per-provider timeouts [35s, 20s] fit within Vercel's 60s maxDuration
- removed all Cerebras references from docs and .env.example
- updated about page, README, and all Starlight docs with new provider chain
- added Google/Groq rate limit doc links where RPM/RPD/TPM tables appear
- added scripts/test-providers.mjs for dry-run provider testing
diff --git a/.env.example b/.env.example
@@ -2,9 +2,9 @@
 # Get yours at https://aistudio.google.com/apikey
 GEMINI_API_KEY=
 
-# Optional fallback providers (for self-hosting)
+# Groq API key (recommended fallback, Llama 3.3 70B)
+# Get yours at https://console.groq.com/keys
 # GROQ_API_KEY=
-# CEREBRAS_API_KEY=
 
 # Firebase configuration (required)
 # Get these from Firebase Console → Project Settings → Your Apps
diff --git a/README.md b/README.md
@@ -81,7 +81,7 @@ Each profile is based on research into the platform's documented parsing and mat
 | **PDF Parsing**  | pdfjs-dist (Web Worker)                            | Mozilla-maintained, fully client-side.                                                            |
 | **DOCX Parsing** | mammoth                                            | Client-side Word to text extraction.                                                              |
 | **NLP**          | Custom TF-IDF + tokenizer + skills taxonomy        | Lightweight, browser-native, supports 8+ industries.                                              |
-| **LLM**          | Gemma 3 27B (primary), Gemini 2.5 Flash (fallback) | 14,400 RPD free tier via Google Generative Language API. Groq + Cerebras available for self-host. |
+| **LLM**          | Gemma 3 27B (primary), Llama 3.3 70B via Groq (fallback) | Cross-provider fallback: Google (14,400 RPD) + Groq (14,400 RPD) on independent free tiers. |
 | **Auth**         | Firebase Authentication                            | Google + email/password sign-in. Free Spark plan.                                                 |
 | **Storage**      | Cloud Firestore                                    | Scan history per user. Free Spark plan.                                                           |
 | **Hosting**      | Vercel                                             | Free hobby tier. Edge functions for API.                                                          |
diff --git a/docs/src/content/docs/api/rate-limits.md b/docs/src/content/docs/api/rate-limits.md
@@ -50,14 +50,16 @@ When you receive a `429` response:
 
 When self-hosting, rate limits are configurable. The actual bottleneck becomes your LLM provider's free tier:
 
-| Provider | Model           | RPM | RPD    |
-| -------- | --------------- | --- | ------ |
-| Gemma    | 3 27B (primary) | 30  | 14,400 |
-| Gemini   | 2.5 Flash       | 5   | 20     |
-| Gemini   | 2.5 Flash Lite  | 10  | 20     |
-| Groq     | Llama 3.3 70B   | 30  | 14,400 |
-| Cerebras | Llama 3.3 70B   | 30  | 1,000  |
+| Provider | Model         | RPM  | RPD    | TPM |
+| -------- | ------------- | ---- | ------ | --- |
+| Google   | Gemma 3 27B   | 30   | 14,400 | 15K |
+| Groq     | Llama 3.3 70B | 1000 | 14,400 | 12K |
+
+For the latest limits, see the official documentation:
+
+- [Google AI rate limits](https://ai.google.dev/gemini-api/docs/rate-limits)
+- [Groq rate limits](https://console.groq.com/docs/rate-limits)
 
 :::tip
-The hosted version uses Gemma 3 27B as the primary model (14,400 RPD), giving roughly 14,000+ scans per day on the free tier. Groq and Cerebras are available as optional fallbacks for self-hosted instances.
+The hosted version uses Gemma 3 27B as the primary model with Llama 3.3 70B via Groq as fallback. Both run on independent free tiers with 14,400 RPD each, giving roughly 28,000+ potential scans per day.
 :::
diff --git a/docs/src/content/docs/getting-started/introduction.md b/docs/src/content/docs/getting-started/introduction.md
@@ -56,7 +56,7 @@ Built with performance and privacy in mind:
 - **SvelteKit 5** with Svelte 5 runes for the frontend
 - **pdfjs-dist** (Web Worker) for client-side PDF parsing
 - **mammoth** for client-side DOCX parsing
-- **Gemma 3 27B** (primary) with Gemini fallbacks for AI-powered analysis
+- **Gemma 3 27B** (primary) with **Llama 3.3 70B** via Groq as fallback for AI-powered analysis
 - **Firebase** for authentication and scan history
 - **Vercel** for hosting (free tier)
 
diff --git a/docs/src/content/docs/self-hosting/configuration.md b/docs/src/content/docs/self-hosting/configuration.md
@@ -7,39 +7,37 @@ description: Environment variables and configuration options for self-hosted ins
 
 All configuration is done through environment variables in the `.env` file.
 
-| Variable           | Required | Description                                          |
-| ------------------ | -------- | ---------------------------------------------------- |
-| `GEMINI_API_KEY`   | Yes      | Google AI API key (used for Gemma 3 + Gemini models) |
-| `GROQ_API_KEY`     | Optional | Groq API key (optional fallback)                     |
-| `CEREBRAS_API_KEY` | Optional | Cerebras API key (optional fallback)                 |
+| Variable         | Required    | Description                                    |
+| ---------------- | ----------- | ---------------------------------------------- |
+| `GEMINI_API_KEY` | Yes         | Google AI API key (powers Gemma 3 27B primary) |
+| `GROQ_API_KEY`   | Recommended | Groq API key (Llama 3.3 70B fallback)          |
 
 :::caution
 Never commit your `.env` file to version control. It's already in `.gitignore`, but double-check before pushing.
 :::
 
 ## Provider Priority
 
-The LLM fallback chain follows this order:
+The LLM fallback chain uses cross-provider redundancy so quota limits on one provider don't cascade:
 
-1. **Gemma 3 27B** (primary, 14,400 RPD via `GEMINI_API_KEY`)
-2. **Gemini 2.5 Flash** (fallback, 20 RPD via `GEMINI_API_KEY`)
-3. **Gemini 2.5 Flash Lite** (fallback, 20 RPD via `GEMINI_API_KEY`)
-4. **Groq Llama 3.3 70B** (if `GROQ_API_KEY` is set)
-5. **Cerebras Llama 3.3 70B** (if `CEREBRAS_API_KEY` is set)
+1. **Gemma 3 27B** via Google (primary, `GEMINI_API_KEY`)
+2. **Llama 3.3 70B** via Groq (fallback, `GROQ_API_KEY`)
 
-If a provider fails (timeout, rate limit, malformed response), the system automatically tries the next one. All Google models (Gemma + Gemini) use the same API key.
+If a provider fails (timeout, rate limit, malformed response), the system automatically tries the next one. Because each provider uses a separate API key, their quotas are completely independent.
 
 ## Free Tier Limits
 
-| Provider | Model           | RPM | RPD    | Cost                   |
-| -------- | --------------- | --- | ------ | ---------------------- |
-| Gemma    | 3 27B (primary) | 30  | 14,400 | Free (blocks at limit) |
-| Gemini   | 2.5 Flash       | 5   | 20     | Free (blocks at limit) |
-| Gemini   | 2.5 Flash Lite  | 10  | 20     | Free (blocks at limit) |
-| Groq     | Llama 3.3 70B   | 30  | 14,400 | Free                   |
-| Cerebras | Llama 3.3 70B   | 30  | 1,000  | Free                   |
+| Provider | Model         | RPM  | RPD    | TPM | Cost |
+| -------- | ------------- | ---- | ------ | --- | ---- |
+| Google   | Gemma 3 27B   | 30   | 14,400 | 15K | Free |
+| Groq     | Llama 3.3 70B | 1000 | 14,400 | 12K | Free |
 
-**Key detail about Google AI:** The free tier will **block** requests at the limit, never auto-charge. You cannot accidentally incur costs.
+Both providers block at their limits and never auto-charge. You cannot accidentally incur costs.
+
+For the latest limits, see the official documentation:
+
+- [Google AI rate limits](https://ai.google.dev/gemini-api/docs/rate-limits)
+- [Groq rate limits](https://console.groq.com/docs/rate-limits)
 
 ## Rate Limiting
 
@@ -56,10 +54,11 @@ Adjust these values based on your expected traffic and API key limits.
 
 ## Timeouts
 
-The default timeout for LLM requests is 60 seconds:
+Each provider has its own timeout to ensure the total worst-case fits within Vercel's 60s function limit:
 
 ```typescript
-const PROVIDER_TIMEOUT_MS = 60_000;
+// Gemma: 35s, Groq: 20s → worst case total: 55s
+const PROVIDER_TIMEOUTS_MS = [35_000, 20_000];
 ```
 
-Increase this if you're experiencing timeouts with longer resumes.
+The Vercel function `maxDuration` is set to 60 seconds. If both providers timeout, the system falls back to rule-based scoring on the client side.
diff --git a/docs/src/content/docs/self-hosting/deployment.md b/docs/src/content/docs/self-hosting/deployment.md
@@ -24,8 +24,7 @@ In the Vercel dashboard:
 1. Go to your project > **Settings** > **Environment Variables**
 2. Add your API keys:
    - `GEMINI_API_KEY` (required)
-   - `GROQ_API_KEY` (optional fallback)
-   - `CEREBRAS_API_KEY` (optional fallback)
+   - `GROQ_API_KEY` (recommended fallback)
 3. Add your Firebase config (all `PUBLIC_FIREBASE_*` variables from `.env.example`)
 
 :::tip
diff --git a/docs/src/content/docs/self-hosting/setup.md b/docs/src/content/docs/self-hosting/setup.md
@@ -33,20 +33,14 @@ cp .env.example .env
 2. Click "Create API Key"
 3. Add to `.env`: `GEMINI_API_KEY=your_key_here`
 
-### Groq (Optional Fallback)
+### Groq (Recommended Fallback)
 
 1. Go to [Groq Console](https://console.groq.com/keys)
 2. Create a new API key
 3. Add to `.env`: `GROQ_API_KEY=your_key_here`
 
-### Cerebras (Optional Fallback)
-
-1. Go to [Cerebras Cloud](https://cloud.cerebras.ai/)
-2. Generate an API key
-3. Add to `.env`: `CEREBRAS_API_KEY=your_key_here`
-
 :::tip
-You only need the **Google AI (Gemini) API key** to run the app. It powers Gemma 3 27B (14,400 RPD) as the primary model plus Gemini models as fallbacks. Groq and Cerebras are optional for additional availability.
+You need the **Google AI API key** to run the app (Gemma 3 27B primary, 14,400 RPD). Adding a **Groq API key** is strongly recommended as it provides a completely independent fallback (Llama 3.3 70B, 14,400 RPD) so users never see failures during peak traffic.
 :::
 
 ## Run Locally
diff --git a/scripts/test-providers.mjs b/scripts/test-providers.mjs
@@ -0,0 +1,155 @@
+/**
+ * dry run: tests each LLM provider matching the fallback chain in +server.ts
+ * reads keys from .env, never logs them.
+ *
+ * usage: node scripts/test-providers.mjs
+ */
+
+import { readFileSync } from 'fs';
+
+const envFile = readFileSync('.env', 'utf-8');
+const envVars = Object.fromEntries(
+	envFile
+		.split('\n')
+		.filter((l) => l && !l.startsWith('#'))
+		.map((l) => {
+			const eq = l.indexOf('=');
+			return eq > 0 ? [l.slice(0, eq).trim(), l.slice(eq + 1).trim()] : null;
+		})
+		.filter(Boolean)
+);
+
+const GEMINI_KEY = envVars.GEMINI_API_KEY;
+const GROQ_KEY = envVars.GROQ_API_KEY;
+
+function extractJSON(raw) {
+	const trimmed = raw.trim();
+	try { return JSON.parse(trimmed); } catch {}
+	const cleaned = trimmed.replace(/```json\n?|\n?```/g, '').trim();
+	try { return JSON.parse(cleaned); } catch {}
+	const s = cleaned.indexOf('{'), e = cleaned.lastIndexOf('}');
+	if (s !== -1 && e > s) { try { return JSON.parse(cleaned.slice(s, e + 1)); } catch {} }
+	return null;
+}
+
+const SMALL_PROMPT = 'Return ONLY valid JSON: {"test": true, "score": 85}';
+
+// ~6K token resume prompt matching real usage
+const BIG_RESUME = (
+	'Experienced software engineer with expertise in distributed systems, cloud computing, and full-stack development. ' +
+	'Built scalable microservices handling 10M+ requests per day using Go, Kubernetes, and AWS. Led team of 5 engineers. '
+).repeat(60);
+const BIG_PROMPT = `You are an ATS scoring engine. Analyze this resume against 6 ATS platforms (Workday, Taleo, iCIMS, Greenhouse, Lever, SuccessFactors). Return ONLY valid JSON with a "results" array containing objects with "system", "overallScore", and "passesFilter" fields. Resume: ${BIG_RESUME}`;
+
+const PROVIDERS = [
+	{
+		name: 'gemma-3-27b (Google)',
+		key: GEMINI_KEY,
+		build: (prompt) => ({
+			url: `https://generativelanguage.googleapis.com/v1beta/models/gemma-3-27b-it:generateContent?key=${GEMINI_KEY}`,
+			opts: {
+				method: 'POST',
+				headers: { 'Content-Type': 'application/json' },
+				body: JSON.stringify({
+					contents: [{ parts: [{ text: prompt }] }],
+					generationConfig: { temperature: 0.3, topP: 0.85, maxOutputTokens: 16384 }
+				})
+			}
+		}),
+		extract: (d) => d.candidates?.[0]?.content?.parts?.[0]?.text ?? ''
+	},
+	{
+		name: 'llama-3.3-70b (Groq)',
+		key: GROQ_KEY,
+		build: (prompt) => ({
+			url: 'https://api.groq.com/openai/v1/chat/completions',
+			opts: {
+				method: 'POST',
+				headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${GROQ_KEY}` },
+				body: JSON.stringify({
+					model: 'llama-3.3-70b-versatile',
+					messages: [{ role: 'user', content: prompt }],
+					temperature: 0.3, top_p: 0.85, max_tokens: 16384,
+					response_format: { type: 'json_object' }
+				})
+			}
+		}),
+		extract: (d) => d.choices?.[0]?.message?.content ?? ''
+	},
+	{
+		name: 'gemini-2.5-flash (Google)',
+		key: GEMINI_KEY,
+		build: (prompt) => ({
+			url: `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key=${GEMINI_KEY}`,
+			opts: {
+				method: 'POST',
+				headers: { 'Content-Type': 'application/json' },
+				body: JSON.stringify({
+					contents: [{ parts: [{ text: prompt }] }],
+					generationConfig: { temperature: 0.3, topP: 0.85, maxOutputTokens: 16384, responseMimeType: 'application/json' }
+				})
+			}
+		}),
+		extract: (d) => d.candidates?.[0]?.content?.parts?.[0]?.text ?? ''
+	}
+];
+
+async function callProvider(provider, prompt, timeoutMs = 30000) {
+	if (!provider.key) return { status: 'SKIP', ms: 0, detail: 'no key' };
+
+	const { url, opts } = provider.build(prompt);
+	const t = performance.now();
+	try {
+		const ctrl = new AbortController();
+		const timer = setTimeout(() => ctrl.abort(), timeoutMs);
+		const res = await fetch(url, { ...opts, signal: ctrl.signal });
+		clearTimeout(timer);
+		const ms = Math.round(performance.now() - t);
+
+		if (!res.ok) {
+			const err = await res.text().catch(() => '');
+			return { status: 'HTTP_ERR', ms, httpStatus: res.status, detail: err.slice(0, 150) };
+		}
+
+		const data = await res.json();
+		const text = provider.extract(data);
+		if (!text) return { status: 'EMPTY', ms };
+
+		const parsed = extractJSON(text);
+		if (!parsed || typeof parsed !== 'object') return { status: 'BAD_JSON', ms, detail: text.slice(0, 150) };
+
+		return { status: 'OK', ms, keys: Object.keys(parsed).slice(0, 5) };
+	} catch (err) {
+		const ms = Math.round(performance.now() - t);
+		const isTimeout = err.name === 'AbortError';
+		return { status: isTimeout ? 'TIMEOUT' : 'ERROR', ms, detail: err.message };
+	}
+}
+
+function log(name, r) {
+	const tag = r.status === 'OK' ? 'OK' : r.status === 'SKIP' ? 'SKIP' : 'FAIL';
+	const info = r.status === 'OK' ? `keys: [${r.keys}]` : (r.detail || r.httpStatus || '');
+	console.log(`  ${tag.padEnd(4)} ${name.padEnd(28)} ${String(r.ms).padStart(5)}ms  ${info}`);
+}
+
+console.log('=== test 1: small prompt (connectivity) ===\n');
+for (const p of PROVIDERS) log(p.name, await callProvider(p, SMALL_PROMPT));
+
+console.log('\n=== test 2: large prompt (~6K tokens, realistic resume) ===\n');
+console.log(`  prompt size: ${BIG_PROMPT.length} chars (~${Math.round(BIG_PROMPT.length / 4)} tokens)\n`);
+for (const p of PROVIDERS) log(p.name, await callProvider(p, BIG_PROMPT, 45000));
+
+console.log('\n=== test 3: fallback chain simulation ===\n');
+let resolved = false;
+for (const p of PROVIDERS) {
+	const r = await callProvider(p, BIG_PROMPT, 45000);
+	if (r.status === 'OK') {
+		console.log(`  resolved: ${p.name} (${r.ms}ms)`);
+		resolved = true;
+		break;
+	}
+	console.log(`  ${p.name}: ${r.status} (${r.ms}ms) → next`);
+}
+if (!resolved) console.log('  ALL FAILED → 503');
+
+console.log('\n=== done ===');
diff --git a/src/routes/about/+page.svelte b/src/routes/about/+page.svelte
@@ -219,7 +219,8 @@
 				<div class="tech-card">
 					<h4>AI</h4>
 					<ul>
-						<li>Google Gemini 2.5 Flash-Lite (primary)</li>
+						<li>Gemma 3 27B via Google (primary)</li>
+						<li>Llama 3.3 70B via Groq (fallback)</li>
 						<li>Rule-based fallback engine</li>
 						<li>TF-IDF keyword matching</li>
 						<li>Skills taxonomy (500+ terms)</li>
diff --git a/src/routes/api/analyze/+server.ts b/src/routes/api/analyze/+server.ts