Skip to content

πŸ” Code Review β€” app.py & predict.py (Security, Performance & Quality Issues)Β #7

Description

@AkshatRaj00

Code Review Report

Reviewed by: Perplexity AI (requested by @AkshatRaj00)
Files reviewed: app.py, predict.py, train.py
Date: 2026-05-28


πŸ”΄ Critical Issues

1. API Key Hardcoding Risk (app.py)

# README says this ❌ DANGEROUS
GEMINI_API_KEY = "your_key_here"

The get_api_key() function is correct, but the README still shows hardcoding pattern. If a user copies from README and commits, the key leaks.

Fix: Remove the hardcoded example from README. Use only:

api_key = st.secrets.get("GOOGLE_API_KEY") or os.getenv("GOOGLE_API_KEY")

2. No Input Sanitization for CSV Upload (app.py β€” Train Tab)

df_train = pd.read_csv(uploaded_file)  # ❌ No size or row limits

A malicious/huge CSV could crash the app or cause memory overflow on Streamlit Cloud.

Fix:

MAX_ROWS = 100_000
if len(df_train) > MAX_ROWS:
    st.error(f"Dataset too large. Maximum {MAX_ROWS} rows allowed.")
    st.stop()

3. Bare except Exception Swallows All Errors (app.py)

try:
    ...
except Exception:
    return ""  # ❌ Silent failure β€” no logging, no user feedback

In get_ai_insight() and init_gemini(), silent catches make debugging nearly impossible in production.

Fix:

except Exception as e:
    st.warning(f"AI insight unavailable: {type(e).__name__}")
    return ""

🟑 Medium Issues

4. safe_predict() Confidence Formula is Hardcoded Heuristic

confidence = min(95.0, max(55.0, round((1 - min(days, 120) / 140) * 100, 1)))  # ❌ Magic numbers

This formula is arbitrary β€” confidence should come from the actual model (e.g., Random Forest's predict_proba or std of tree predictions).

Fix: Use RF's tree variance for real confidence:

estimators = rf_model.estimators_
predictions = [tree.predict(X)[0] for tree in estimators]
confidence = 100 - (np.std(predictions) / np.mean(predictions) * 100)

5. st.html() Used for Global CSS β€” Deprecated Pattern

st.html("""
    <style>...</style>
""")

st.html() is meant for HTML content, not global styles. Use st.markdown(unsafe_allow_html=True) for CSS injection.

Fix:

st.markdown("<style>...</style>", unsafe_allow_html=True)

6. Session State History Has No Size Limit

st.session_state.history.insert(0, {...})  # ❌ Grows indefinitely

After many predictions, this bloats session state memory.

Fix:

MAX_HISTORY = 100
st.session_state.history.insert(0, entry)
st.session_state.history = st.session_state.history[:MAX_HISTORY]

7. @st.cache_data on load_sample_data() β€” TTL Missing

@st.cache_data(show_spinner=False)  # ❌ Caches forever
def load_sample_data():

If sample_data.csv is updated, the cached version persists until redeployment.

Fix:

@st.cache_data(show_spinner=False, ttl=600)  # 10 min cache
def load_sample_data():

🟒 Minor / Good Practices Missing

8. No requirements.txt Version Pinning

Current requirements.txt likely has unpinned versions. This causes build failures when upstream packages release breaking changes.

Fix: Pin all versions:

streamlit==1.35.0
google-generativeai==0.7.2
scikit-learn==1.5.0
pandas==2.2.2
numpy==1.26.4

9. Missing __pycache__ in .gitignore

The repo currently has a __pycache__/ folder committed (visible in file tree). This should never be committed.

Fix β€” add to .gitignore:

__pycache__/
*.pyc
*.pyo
.env
*.pkl  # Optional β€” large binary files

10. No Loading State for Model Status on First Visit

If model is not loaded, the UI shows an error but doesn't guide the user clearly to the Train tab first.

Fix: Add a visual onboarding banner:

if not MODEL_LOADED:
    st.info("πŸ‘‹ Welcome! No model found. Head to the **Train** tab to upload your CSV and train the model first.")

βœ… What's Done Well

  • Clean tab-based layout (Predict / Train / Analytics) β€” very professional
  • validate_payload() is a great defensive pattern
  • confidence_badge() with color coding is a nice UX touch
  • Sidebar model status indicator is clean
  • Gemini integration with secrets-based API key is the right approach
  • reload_models() after training β€” good pattern

Summary Table

# Issue Severity File
1 API key hardcoding in README πŸ”΄ Critical README.md
2 No CSV upload size limit πŸ”΄ Critical app.py
3 Silent exception swallowing πŸ”΄ Critical app.py
4 Heuristic confidence formula 🟑 Medium app.py
5 st.html() for CSS injection 🟑 Medium app.py
6 Unbounded session history 🟑 Medium app.py
7 Cache TTL missing 🟑 Medium app.py
8 Unpinned requirements 🟒 Minor requirements.txt
9 __pycache__ committed 🟒 Minor .gitignore
10 No onboarding for new users 🟒 Minor app.py

πŸ’‘ Overall: Solid production-style app with clean architecture. Fixing the 3 critical issues (API key, CSV limits, silent errors) would make this deployment-safe. The confidence formula upgrade would significantly improve model trustworthiness.

Reviewed for: @AkshatRaj00
Labels suggested: bug, enhancement, security

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions