feat(api): added Groq Llama 3.3 70B as cross-provider fallback

sunnypatell · sunnypatell · commit bebe39525a06 · 2026-02-23T16:12:47.000-05:00
- replaced Google Flash/Flash-Lite fallbacks with Groq (independent quota, 14,400 RPD)
- moved JSON validation inside retry loop so bad responses trigger next provider
- per-provider timeouts [35s, 20s] fit within Vercel's 60s maxDuration
- removed all Cerebras references from docs and .env.example
- updated about page, README, and all Starlight docs with new provider chain
- added Google/Groq rate limit doc links where RPM/RPD/TPM tables appear
- added scripts/test-providers.mjs for dry-run provider testing
diff --git a/.env.example b/.env.example
@@ -2,9 +2,9 @@
 # Get yours at https://aistudio.google.com/apikey
 GEMINI_API_KEY=
 
-# Optional fallback providers (for self-hosting)
+# Groq API key (recommended fallback, Llama 3.3 70B)
+# Get yours at https://console.groq.com/keys
 # GROQ_API_KEY=
-# CEREBRAS_API_KEY=
 
 # Firebase configuration (required)
 # Get these from Firebase Console → Project Settings → Your Apps
diff --git a/README.md b/README.md
@@ -81,7 +81,7 @@ Each profile is based on research into the platform's documented parsing and mat
 | **PDF Parsing**  | pdfjs-dist (Web Worker)                            | Mozilla-maintained, fully client-side.                                                            |
 | **DOCX Parsing** | mammoth                                            | Client-side Word to text extraction.                                                              |
 | **NLP**          | Custom TF-IDF + tokenizer + skills taxonomy        | Lightweight, browser-native, supports 8+ industries.                                              |
-| **LLM**          | Gemma 3 27B (primary), Gemini 2.5 Flash (fallback) | 14,400 RPD free tier via Google Generative Language API. Groq + Cerebras available for self-host. |
+| **LLM**          | Gemma 3 27B (primary), Llama 3.3 70B via Groq (fallback) | Cross-provider fallback: Google (14,400 RPD) + Groq (14,400 RPD) on independent free tiers. |
 | **Auth**         | Firebase Authentication                            | Google + email/password sign-in. Free Spark plan.                                                 |
 | **Storage**      | Cloud Firestore                                    | Scan history per user. Free Spark plan.                                                           |
 | **Hosting**      | Vercel                                             | Free hobby tier. Edge functions for API.                                                          |
diff --git a/docs/src/content/docs/api/rate-limits.md b/docs/src/content/docs/api/rate-limits.md
@@ -50,14 +50,16 @@ When you receive a `429` response:
 
 When self-hosting, rate limits are configurable. The actual bottleneck becomes your LLM provider's free tier:
 
-| Provider | Model           | RPM | RPD    |
-| -------- | --------------- | --- | ------ |
-| Gemma    | 3 27B (primary) | 30  | 14,400 |
-| Gemini   | 2.5 Flash       | 5   | 20     |
-| Gemini   | 2.5 Flash Lite  | 10  | 20     |
-| Groq     | Llama 3.3 70B   | 30  | 14,400 |
-| Cerebras | Llama 3.3 70B   | 30  | 1,000  |
+| Provider | Model         | RPM  | RPD    | TPM |
+| -------- | ------------- | ---- | ------ | --- |
+| Google   | Gemma 3 27B   | 30   | 14,400 | 15K |
+| Groq     | Llama 3.3 70B | 1000 | 14,400 | 12K |
+
+For the latest limits, see the official documentation:
+
+- [Google AI rate limits](https://ai.google.dev/gemini-api/docs/rate-limits)
+- [Groq rate limits](https://console.groq.com/docs/rate-limits)
 
 :::tip
-The hosted version uses Gemma 3 27B as the primary model (14,400 RPD), giving roughly 14,000+ scans per day on the free tier. Groq and Cerebras are available as optional fallbacks for self-hosted instances.
+The hosted version uses Gemma 3 27B as the primary model with Llama 3.3 70B via Groq as fallback. Both run on independent free tiers. The binding constraint is TPM (tokens per minute), not RPD. Each scan uses ~8,000 tokens total (prompt + response), giving a realistic combined throughput of roughly 4,500 scans per day under sustained load.
 :::
diff --git a/docs/src/content/docs/getting-started/introduction.md b/docs/src/content/docs/getting-started/introduction.md
@@ -56,7 +56,7 @@ Built with performance and privacy in mind:
 - **SvelteKit 5** with Svelte 5 runes for the frontend
 - **pdfjs-dist** (Web Worker) for client-side PDF parsing
 - **mammoth** for client-side DOCX parsing
-- **Gemma 3 27B** (primary) with Gemini fallbacks for AI-powered analysis
+- **Gemma 3 27B** (primary) with **Llama 3.3 70B** via Groq as fallback for AI-powered analysis
 - **Firebase** for authentication and scan history
 - **Vercel** for hosting (free tier)
 
diff --git a/docs/src/content/docs/self-hosting/configuration.md b/docs/src/content/docs/self-hosting/configuration.md
@@ -7,39 +7,37 @@ description: Environment variables and configuration options for self-hosted ins
 
 All configuration is done through environment variables in the `.env` file.
 
-| Variable           | Required | Description                                          |
-| ------------------ | -------- | ---------------------------------------------------- |
-| `GEMINI_API_KEY`   | Yes      | Google AI API key (used for Gemma 3 + Gemini models) |
-| `GROQ_API_KEY`     | Optional | Groq API key (optional fallback)                     |
-| `CEREBRAS_API_KEY` | Optional | Cerebras API key (optional fallback)                 |
+| Variable         | Required    | Description                                    |
+| ---------------- | ----------- | ---------------------------------------------- |
+| `GEMINI_API_KEY` | Yes         | Google AI API key (powers Gemma 3 27B primary) |
+| `GROQ_API_KEY`   | Recommended | Groq API key (Llama 3.3 70B fallback)          |
 
 :::caution
 Never commit your `.env` file to version control. It's already in `.gitignore`, but double-check before pushing.
 :::
 
 ## Provider Priority
 
-The LLM fallback chain follows this order:
+The LLM fallback chain uses cross-provider redundancy so quota limits on one provider don't cascade:
 
-1. **Gemma 3 27B** (primary, 14,400 RPD via `GEMINI_API_KEY`)
-2. **Gemini 2.5 Flash** (fallback, 20 RPD via `GEMINI_API_KEY`)
-3. **Gemini 2.5 Flash Lite** (fallback, 20 RPD via `GEMINI_API_KEY`)
-4. **Groq Llama 3.3 70B** (if `GROQ_API_KEY` is set)
-5. **Cerebras Llama 3.3 70B** (if `CEREBRAS_API_KEY` is set)
+1. **Gemma 3 27B** via Google (primary, `GEMINI_API_KEY`)
+2. **Llama 3.3 70B** via Groq (fallback, `GROQ_API_KEY`)
 
-If a provider fails (timeout, rate limit, malformed response), the system automatically tries the next one. All Google models (Gemma + Gemini) use the same API key.
+If a provider fails (timeout, rate limit, malformed response), the system automatically tries the next one. Because each provider uses a separate API key, their quotas are completely independent.
 
 ## Free Tier Limits
 
-| Provider | Model           | RPM | RPD    | Cost                   |
-| -------- | --------------- | --- | ------ | ---------------------- |
-| Gemma    | 3 27B (primary) | 30  | 14,400 | Free (blocks at limit) |
-| Gemini   | 2.5 Flash       | 5   | 20     | Free (blocks at limit) |
-| Gemini   | 2.5 Flash Lite  | 10  | 20     | Free (blocks at limit) |
-| Groq     | Llama 3.3 70B   | 30  | 14,400 | Free                   |
-| Cerebras | Llama 3.3 70B   | 30  | 1,000  | Free                   |
+| Provider | Model         | RPM  | RPD    | TPM | Cost |
+| -------- | ------------- | ---- | ------ | --- | ---- |
+| Google   | Gemma 3 27B   | 30   | 14,400 | 15K | Free |
+| Groq     | Llama 3.3 70B | 1000 | 14,400 | 12K | Free |
 
-**Key detail about Google AI:** The free tier will **block** requests at the limit, never auto-charge. You cannot accidentally incur costs.
+Both providers block at their limits and never auto-charge. You cannot accidentally incur costs.
+
+For the latest limits, see the official documentation:
+
+- [Google AI rate limits](https://ai.google.dev/gemini-api/docs/rate-limits)
+- [Groq rate limits](https://console.groq.com/docs/rate-limits)
 
 ## Rate Limiting
 
@@ -56,10 +54,11 @@ Adjust these values based on your expected traffic and API key limits.
 
 ## Timeouts
 
-The default timeout for LLM requests is 60 seconds:
+Each provider has its own timeout to ensure the total worst-case fits within Vercel's 60s function limit:
 
 ```typescript
-const PROVIDER_TIMEOUT_MS = 60_000;
+// Gemma: 35s, Groq: 20s → worst case total: 55s
+const PROVIDER_TIMEOUTS_MS = [35_000, 20_000];
 ```
 
-Increase this if you're experiencing timeouts with longer resumes.
+The Vercel function `maxDuration` is set to 60 seconds. If both providers timeout, the system falls back to rule-based scoring on the client side.
diff --git a/docs/src/content/docs/self-hosting/deployment.md b/docs/src/content/docs/self-hosting/deployment.md
@@ -24,8 +24,7 @@ In the Vercel dashboard:
 1. Go to your project > **Settings** > **Environment Variables**
 2. Add your API keys:
    - `GEMINI_API_KEY` (required)
-   - `GROQ_API_KEY` (optional fallback)
-   - `CEREBRAS_API_KEY` (optional fallback)
+   - `GROQ_API_KEY` (recommended fallback)
 3. Add your Firebase config (all `PUBLIC_FIREBASE_*` variables from `.env.example`)
 
 :::tip
diff --git a/docs/src/content/docs/self-hosting/setup.md b/docs/src/content/docs/self-hosting/setup.md
@@ -33,20 +33,14 @@ cp .env.example .env
 2. Click "Create API Key"
 3. Add to `.env`: `GEMINI_API_KEY=your_key_here`
 
-### Groq (Optional Fallback)
+### Groq (Recommended Fallback)
 
 1. Go to [Groq Console](https://console.groq.com/keys)
 2. Create a new API key
 3. Add to `.env`: `GROQ_API_KEY=your_key_here`
 
-### Cerebras (Optional Fallback)
-
-1. Go to [Cerebras Cloud](https://cloud.cerebras.ai/)
-2. Generate an API key
-3. Add to `.env`: `CEREBRAS_API_KEY=your_key_here`
-
 :::tip
-You only need the **Google AI (Gemini) API key** to run the app. It powers Gemma 3 27B (14,400 RPD) as the primary model plus Gemini models as fallbacks. Groq and Cerebras are optional for additional availability.
+You need the **Google AI API key** to run the app (Gemma 3 27B primary, 14,400 RPD). Adding a **Groq API key** is strongly recommended as it provides a completely independent fallback (Llama 3.3 70B, 14,400 RPD) so users never see failures during peak traffic.
 :::
 
 ## Run Locally
diff --git a/eslint.config.js b/eslint.config.js
@@ -41,7 +41,8 @@ export default ts.config(
 			'playwright-report/',
 			'test-results/',
 			'docs/',
-			'static/docs/'
+			'static/docs/',
+			'scripts/'
 		]
 	}
 );
diff --git a/scripts/test-providers.mjs b/scripts/test-providers.mjs
@@ -0,0 +1,155 @@
+/**
+ * dry run: tests each LLM provider matching the fallback chain in +server.ts
+ * reads keys from .env, never logs them.
+ *
+ * usage: node scripts/test-providers.mjs
+ */
+
+import { readFileSync } from 'fs';
+
+const envFile = readFileSync('.env', 'utf-8');
+const envVars = Object.fromEntries(
+	envFile
+		.split('\n')
+		.filter((l) => l && !l.startsWith('#'))
+		.map((l) => {
+			const eq = l.indexOf('=');
+			return eq > 0 ? [l.slice(0, eq).trim(), l.slice(eq + 1).trim()] : null;
+		})
+		.filter(Boolean)
+);
+
+const GEMINI_KEY = envVars.GEMINI_API_KEY;
+const GROQ_KEY = envVars.GROQ_API_KEY;
+
+function extractJSON(raw) {
+	const trimmed = raw.trim();
+	try { return JSON.parse(trimmed); } catch {}
+	const cleaned = trimmed.replace(/```json\n?|\n?```/g, '').trim();
+	try { return JSON.parse(cleaned); } catch {}
+	const s = cleaned.indexOf('{'), e = cleaned.lastIndexOf('}');
+	if (s !== -1 && e > s) { try { return JSON.parse(cleaned.slice(s, e + 1)); } catch {} }
+	return null;
+}
+
+const SMALL_PROMPT = 'Return ONLY valid JSON: {"test": true, "score": 85}';
+
+// ~6K token resume prompt matching real usage
+const BIG_RESUME = (
+	'Experienced software engineer with expertise in distributed systems, cloud computing, and full-stack development. ' +
+	'Built scalable microservices handling 10M+ requests per day using Go, Kubernetes, and AWS. Led team of 5 engineers. '
+).repeat(60);
+const BIG_PROMPT = `You are an ATS scoring engine. Analyze this resume against 6 ATS platforms (Workday, Taleo, iCIMS, Greenhouse, Lever, SuccessFactors). Return ONLY valid JSON with a "results" array containing objects with "system", "overallScore", and "passesFilter" fields. Resume: ${BIG_RESUME}`;
+
+const PROVIDERS = [
+	{
+		name: 'gemma-3-27b (Google)',
+		key: GEMINI_KEY,
+		build: (prompt) => ({
+			url: `https://generativelanguage.googleapis.com/v1beta/models/gemma-3-27b-it:generateContent?key=${GEMINI_KEY}`,
+			opts: {
+				method: 'POST',
+				headers: { 'Content-Type': 'application/json' },
+				body: JSON.stringify({
+					contents: [{ parts: [{ text: prompt }] }],
+					generationConfig: { temperature: 0.3, topP: 0.85, maxOutputTokens: 16384 }
+				})
+			}
+		}),
+		extract: (d) => d.candidates?.[0]?.content?.parts?.[0]?.text ?? ''
+	},
+	{
+		name: 'llama-3.3-70b (Groq)',
+		key: GROQ_KEY,
+		build: (prompt) => ({
+			url: 'https://api.groq.com/openai/v1/chat/completions',
+			opts: {
+				method: 'POST',
+				headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${GROQ_KEY}` },
+				body: JSON.stringify({
+					model: 'llama-3.3-70b-versatile',
+					messages: [{ role: 'user', content: prompt }],
+					temperature: 0.3, top_p: 0.85, max_tokens: 16384,
+					response_format: { type: 'json_object' }
+				})
+			}
+		}),
+		extract: (d) => d.choices?.[0]?.message?.content ?? ''
+	},
+	{
+		name: 'gemini-2.5-flash (Google)',
+		key: GEMINI_KEY,
+		build: (prompt) => ({
+			url: `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key=${GEMINI_KEY}`,
+			opts: {
+				method: 'POST',
+				headers: { 'Content-Type': 'application/json' },
+				body: JSON.stringify({
+					contents: [{ parts: [{ text: prompt }] }],
+					generationConfig: { temperature: 0.3, topP: 0.85, maxOutputTokens: 16384, responseMimeType: 'application/json' }
+				})
+			}
+		}),
+		extract: (d) => d.candidates?.[0]?.content?.parts?.[0]?.text ?? ''
+	}
+];
+
+async function callProvider(provider, prompt, timeoutMs = 30000) {
+	if (!provider.key) return { status: 'SKIP', ms: 0, detail: 'no key' };
+
+	const { url, opts } = provider.build(prompt);
+	const t = performance.now();
+	try {
+		const ctrl = new AbortController();
+		const timer = setTimeout(() => ctrl.abort(), timeoutMs);
+		const res = await fetch(url, { ...opts, signal: ctrl.signal });
+		clearTimeout(timer);
+		const ms = Math.round(performance.now() - t);
+
+		if (!res.ok) {
+			const err = await res.text().catch(() => '');
+			return { status: 'HTTP_ERR', ms, httpStatus: res.status, detail: err.slice(0, 150) };
+		}
+
+		const data = await res.json();
+		const text = provider.extract(data);
+		if (!text) return { status: 'EMPTY', ms };
+
+		const parsed = extractJSON(text);
+		if (!parsed || typeof parsed !== 'object') return { status: 'BAD_JSON', ms, detail: text.slice(0, 150) };
+
+		return { status: 'OK', ms, keys: Object.keys(parsed).slice(0, 5) };
+	} catch (err) {
+		const ms = Math.round(performance.now() - t);
+		const isTimeout = err.name === 'AbortError';
+		return { status: isTimeout ? 'TIMEOUT' : 'ERROR', ms, detail: err.message };
+	}
+}
+
+function log(name, r) {
+	const tag = r.status === 'OK' ? 'OK' : r.status === 'SKIP' ? 'SKIP' : 'FAIL';
+	const info = r.status === 'OK' ? `keys: [${r.keys}]` : (r.detail || r.httpStatus || '');
+	console.log(`  ${tag.padEnd(4)} ${name.padEnd(28)} ${String(r.ms).padStart(5)}ms  ${info}`);
+}
+
+console.log('=== test 1: small prompt (connectivity) ===\n');
+for (const p of PROVIDERS) log(p.name, await callProvider(p, SMALL_PROMPT));
+
+console.log('\n=== test 2: large prompt (~6K tokens, realistic resume) ===\n');
+console.log(`  prompt size: ${BIG_PROMPT.length} chars (~${Math.round(BIG_PROMPT.length / 4)} tokens)\n`);
+for (const p of PROVIDERS) log(p.name, await callProvider(p, BIG_PROMPT, 45000));
+
+console.log('\n=== test 3: fallback chain simulation ===\n');
+let resolved = false;
+for (const p of PROVIDERS) {
+	const r = await callProvider(p, BIG_PROMPT, 45000);
+	if (r.status === 'OK') {
+		console.log(`  resolved: ${p.name} (${r.ms}ms)`);
+		resolved = true;
+		break;
+	}
+	console.log(`  ${p.name}: ${r.status} (${r.ms}ms) → next`);
+}
+if (!resolved) console.log('  ALL FAILED → 503');
+
+console.log('\n=== done ===');
diff --git a/src/routes/about/+page.svelte b/src/routes/about/+page.svelte
@@ -219,7 +219,8 @@
 				<div class="tech-card">
 					<h4>AI</h4>
 					<ul>
-						<li>Google Gemini 2.5 Flash-Lite (primary)</li>
+						<li>Gemma 3 27B via Google (primary)</li>
+						<li>Llama 3.3 70B via Groq (fallback)</li>
 						<li>Rule-based fallback engine</li>
 						<li>TF-IDF keyword matching</li>
 						<li>Skills taxonomy (500+ terms)</li>
diff --git a/src/routes/api/analyze/+server.ts b/src/routes/api/analyze/+server.ts

Original file line number	Diff line number	Diff line change
`@@ -41,7 +41,8 @@ export default ts.config(`
`41`	`41`	`'playwright-report/',`
`42`	`42`	`'test-results/',`
`43`	`43`	`'docs/',`
`44`		`- 'static/docs/'`
	`44`	`+ 'static/docs/',`
	`45`	`+ 'scripts/'`
`45`	`46`	`]`
`46`	`47`	`}`
`47`	`48`	`);`