🔍 Code Review — app.py & predict.py (Security, Performance & Quality Issues)

## Code Review Report
**Reviewed by:** Perplexity AI (requested by @AkshatRaj00)
**Files reviewed:** `app.py`, `predict.py`, `train.py`
**Date:** 2026-05-28

---

## 🔴 Critical Issues

### 1. API Key Hardcoding Risk (`app.py`)
```python
# README says this ❌ DANGEROUS
GEMINI_API_KEY = "your_key_here"
```
The `get_api_key()` function is correct, but the README still shows hardcoding pattern. If a user copies from README and commits, the key leaks.

**Fix:** Remove the hardcoded example from README. Use only:
```python
api_key = st.secrets.get("GOOGLE_API_KEY") or os.getenv("GOOGLE_API_KEY")
```

---

### 2. No Input Sanitization for CSV Upload (`app.py` — Train Tab)
```python
df_train = pd.read_csv(uploaded_file)  # ❌ No size or row limits
```
A malicious/huge CSV could crash the app or cause memory overflow on Streamlit Cloud.

**Fix:**
```python
MAX_ROWS = 100_000
if len(df_train) > MAX_ROWS:
    st.error(f"Dataset too large. Maximum {MAX_ROWS} rows allowed.")
    st.stop()
```

---

### 3. Bare `except Exception` Swallows All Errors (`app.py`)
```python
try:
    ...
except Exception:
    return ""  # ❌ Silent failure — no logging, no user feedback
```
In `get_ai_insight()` and `init_gemini()`, silent catches make debugging nearly impossible in production.

**Fix:**
```python
except Exception as e:
    st.warning(f"AI insight unavailable: {type(e).__name__}")
    return ""
```

---

## 🟡 Medium Issues

### 4. `safe_predict()` Confidence Formula is Hardcoded Heuristic
```python
confidence = min(95.0, max(55.0, round((1 - min(days, 120) / 140) * 100, 1)))  # ❌ Magic numbers
```
This formula is arbitrary — confidence should come from the actual model (e.g., Random Forest's `predict_proba` or std of tree predictions).

**Fix:** Use RF's tree variance for real confidence:
```python
estimators = rf_model.estimators_
predictions = [tree.predict(X)[0] for tree in estimators]
confidence = 100 - (np.std(predictions) / np.mean(predictions) * 100)
```

---

### 5. `st.html()` Used for Global CSS — Deprecated Pattern
```python
st.html("""
    <style>...</style>
""")
```
`st.html()` is meant for HTML content, not global styles. Use `st.markdown(unsafe_allow_html=True)` for CSS injection.

**Fix:**
```python
st.markdown("<style>...</style>", unsafe_allow_html=True)
```

---

### 6. Session State History Has No Size Limit
```python
st.session_state.history.insert(0, {...})  # ❌ Grows indefinitely
```
After many predictions, this bloats session state memory.

**Fix:**
```python
MAX_HISTORY = 100
st.session_state.history.insert(0, entry)
st.session_state.history = st.session_state.history[:MAX_HISTORY]
```

---

### 7. `@st.cache_data` on `load_sample_data()` — TTL Missing
```python
@st.cache_data(show_spinner=False)  # ❌ Caches forever
def load_sample_data():
```
If `sample_data.csv` is updated, the cached version persists until redeployment.

**Fix:**
```python
@st.cache_data(show_spinner=False, ttl=600)  # 10 min cache
def load_sample_data():
```

---

## 🟢 Minor / Good Practices Missing

### 8. No `requirements.txt` Version Pinning
Current `requirements.txt` likely has unpinned versions. This causes build failures when upstream packages release breaking changes.

**Fix:** Pin all versions:
```
streamlit==1.35.0
google-generativeai==0.7.2
scikit-learn==1.5.0
pandas==2.2.2
numpy==1.26.4
```

---

### 9. Missing `__pycache__` in `.gitignore`
The repo currently has a `__pycache__/` folder committed (visible in file tree). This should never be committed.

**Fix — add to `.gitignore`:**
```
__pycache__/
*.pyc
*.pyo
.env
*.pkl  # Optional — large binary files
```

---

### 10. No Loading State for Model Status on First Visit
If model is not loaded, the UI shows an error but doesn't guide the user clearly to the Train tab first.

**Fix:** Add a visual onboarding banner:
```python
if not MODEL_LOADED:
    st.info("👋 Welcome! No model found. Head to the **Train** tab to upload your CSV and train the model first.")
```

---

## ✅ What's Done Well

- Clean tab-based layout (`Predict / Train / Analytics`) — very professional
- `validate_payload()` is a great defensive pattern
- `confidence_badge()` with color coding is a nice UX touch
- Sidebar model status indicator is clean
- Gemini integration with secrets-based API key is the right approach
- `reload_models()` after training — good pattern

---

## Summary Table

| # | Issue | Severity | File |
|---|---|---|---|
| 1 | API key hardcoding in README | 🔴 Critical | README.md |
| 2 | No CSV upload size limit | 🔴 Critical | app.py |
| 3 | Silent exception swallowing | 🔴 Critical | app.py |
| 4 | Heuristic confidence formula | 🟡 Medium | app.py |
| 5 | `st.html()` for CSS injection | 🟡 Medium | app.py |
| 6 | Unbounded session history | 🟡 Medium | app.py |
| 7 | Cache TTL missing | 🟡 Medium | app.py |
| 8 | Unpinned requirements | 🟢 Minor | requirements.txt |
| 9 | `__pycache__` committed | 🟢 Minor | .gitignore |
| 10 | No onboarding for new users | 🟢 Minor | app.py |

---

> 💡 **Overall:** Solid production-style app with clean architecture. Fixing the 3 critical issues (API key, CSV limits, silent errors) would make this deployment-safe. The confidence formula upgrade would significantly improve model trustworthiness.

**Reviewed for:** @AkshatRaj00
**Labels suggested:** `bug`, `enhancement`, `security`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔍 Code Review — app.py & predict.py (Security, Performance & Quality Issues) #7

Code Review Report

🔴 Critical Issues

1. API Key Hardcoding Risk (`app.py`)

2. No Input Sanitization for CSV Upload (`app.py` — Train Tab)

3. Bare `except Exception` Swallows All Errors (`app.py`)

🟡 Medium Issues

4. `safe_predict()` Confidence Formula is Hardcoded Heuristic

5. `st.html()` Used for Global CSS — Deprecated Pattern

6. Session State History Has No Size Limit

7. `@st.cache_data` on `load_sample_data()` — TTL Missing

🟢 Minor / Good Practices Missing

8. No `requirements.txt` Version Pinning

9. Missing `pycache` in `.gitignore`

10. No Loading State for Model Status on First Visit

✅ What's Done Well

Summary Table

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

#	Issue	Severity	File
1	API key hardcoding in README	🔴 Critical	README.md
2	No CSV upload size limit	🔴 Critical	app.py
3	Silent exception swallowing	🔴 Critical	app.py
4	Heuristic confidence formula	🟡 Medium	app.py
5	`st.html()` for CSS injection	🟡 Medium	app.py
6	Unbounded session history	🟡 Medium	app.py
7	Cache TTL missing	🟡 Medium	app.py
8	Unpinned requirements	🟢 Minor	requirements.txt
9	`__pycache__` committed	🟢 Minor	.gitignore
10	No onboarding for new users	🟢 Minor	app.py

🔍 Code Review — app.py & predict.py (Security, Performance & Quality Issues) #7

Description

Code Review Report

🔴 Critical Issues

1. API Key Hardcoding Risk (app.py)

2. No Input Sanitization for CSV Upload (app.py — Train Tab)

3. Bare except Exception Swallows All Errors (app.py)

🟡 Medium Issues

4. safe_predict() Confidence Formula is Hardcoded Heuristic

5. st.html() Used for Global CSS — Deprecated Pattern

6. Session State History Has No Size Limit

7. @st.cache_data on load_sample_data() — TTL Missing

🟢 Minor / Good Practices Missing

8. No requirements.txt Version Pinning

9. Missing __pycache__ in .gitignore

10. No Loading State for Model Status on First Visit

✅ What's Done Well

Summary Table

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. API Key Hardcoding Risk (`app.py`)

2. No Input Sanitization for CSV Upload (`app.py` — Train Tab)

3. Bare `except Exception` Swallows All Errors (`app.py`)

4. `safe_predict()` Confidence Formula is Hardcoded Heuristic

5. `st.html()` Used for Global CSS — Deprecated Pattern

7. `@st.cache_data` on `load_sample_data()` — TTL Missing

8. No `requirements.txt` Version Pinning

9. Missing `pycache` in `.gitignore`