- Project Overview
- Why This Repo Exists
- Key Features
- Architecture Diagram
- Installation & Setup
- How to Run the Notebooks
- Core Modules & Functions Explained
- Extending & Customising the Project
- Testing & Development Workflow
- Troubleshooting & FAQ
- License
- Acknowledgements & Further Reading
Web Brochure Builder is a Pythonβbased Jupyter notebook suite that turns any public company website into a concise, markdownβformatted brochure.
It does this by:
- Scraping the landing page (and a curated set of secondary pages).
- Feeding the raw text to a Large Language Model (LLM) (OpenAI GPTβ4oβmini by default).
- Prompting the LLM to generate a short, engaging marketing brochure in English.
- Translating that brochure into any supported language, injecting humor and cultural nuance while preserving markdown structure.
All of this runs locally (or in a hosted notebook) with a single API key and a few Python dependencies.
Creating a polished marketing brochure normally involves:
- Manual reading of dozens of web pages.
- Copyβpasting, editing, and reβformatting content.
- Hiring a copywriter or relying on generic translation tools.
Web Brochure Builder automates the research and drafting steps, allowing founders, marketers, investors, or developers to get a readyβtoβpublish brochure in seconds.
The multilingual extension adds a witty, culturallyβaware translation layer, making it useful for global outreach.
Feature | Description | Benefits |
---|---|---|
URL Validation | Uses validators.url() to ensure the input is a wellβformed URL before any network request. |
Prevents wasted API calls and cryptic errors. |
Dynamic Link Selection | Sends the full list of discovered links to GPT with a system prompt that returns the most brochureβrelevant links in JSON. | Guarantees the final narrative focuses on About, Careers, Docs, etc., not on irrelevant pages. |
Recursive Scraping | For each selected link, extracts clean title & body text (removes scripts, styles, images). | Supplies rich context for a richer brochure. |
LLMβPowered Brochure Generation | A system prompt guides GPTβ4oβmini to produce a short, markdownβstyled brochure. | Leverages stateβofβtheβart language generation without handβcrafting the copy. |
Streaming Output | Optional streaming function (stream_gpt , stream_claude , stream_gemini ) updates the UI in realβtime. |
Improves UX: you can watch the story unfold. |
Multilingual Humor Translation | Uses pycountry for language validation and a second system prompt to translate with wit. |
Turns a plain translation into a memorable, culturallyβaware piece. |
Model Flexibility | Supports OpenAI, Anthropic Claude, and Google Gemini (via dedicated streaming wrappers). | Futureβproofs the repo for any LLM you prefer. |
Gradio UI | A lightweight web UI (Project Brochure Studio.ipynb ) lets nonβtechnical users pick a model and generate a brochure instantly. |
No notebook knowledge required. |
MIT License | Free for commercial and private use. | Encourages community contributions. |
βββββββββββββββββββββββ
β User Input (URL) β
βββββββββ¬ββββββββββββββ
β
βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β validators.url() β β pycountry (lang) β
βββββββββ¬ββββββββββββββ βββββββββ¬ββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β requests.get() β β OpenAI API (gpt) β
β (HTML β soup) β β (system/user) β
βββββββββ¬ββββββββββββββ βββββββββ¬ββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β extract links ββββΆβ linkβselection LLM β
β (BeautifulSoup) β β (JSON output) β
βββββββββ¬ββββββββββββββ βββββββββ¬ββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β scrape each link ββββΆβ brochure LLM β
β (title, text) β β (markdown output) β
βββββββββ¬ββββββββββββββ βββββββββ¬ββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β combine content ββββΆβ translation LLM β
β (master string) β β (humorous, lang) β
βββββββββ¬ββββββββββββββ βββββββββ¬ββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββ βββββββββββββββββββββββ
β display markdown β β optional streaming β
β (IPython.display) β β (realβtime) β
βββββββββββββββββββββββ βββββββββββββββββββββββ
git clone https://github.com/AsutoshaNanda/Web-Brochure-Builder.git
cd Web-Brochure-Builder
python -m venv .venv
# Linux / macOS
source .venv/bin/activate
# Windows
.venv\Scripts\activate
The repo ships a requirements.txt
(if missing, the command below works):
pip install -r requirements.txt
# or manually:
pip install openai anthropic google-generativeai python-dotenv validators beautifulsoup4 pycountry gradio
Create a .env
file in the repo root:
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxx
GOOGLE_API_KEY=AIzaSyxxxxxxxxxxxxxxx
Tip: Only the OpenAI key is required for the default pipeline. Claude and Gemini keys are optional.
jupyter notebook
Open either notebook (see next section).
A Gradioβpowered UI that lets you:
- Input a company name, URL, and pick a model (GPT, Claude, Gemini).
- Press Generate, and watch a streaming markdown brochure appear.
- Share the live UI via Gradioβs
share=True
link (useful for quick demos).
Typical workflow
- Fill the three fields.
- Click Generate.
- Copy the markdown output or export it to a file.
A fullβpipeline notebook that:
- Validates the URL and language code.
- Scrapes the landing page, extracts title & body.
- Finds additional relevant pages via an LLMβdriven linkβselection step.
- Aggregates content from all pages into a single master string.
- Generates an English brochure with GPTβ4oβmini.
- Translates the brochure into a target language, injecting humor while preserving markdown structure.
Running the pipeline
# Example (run each cell sequentially)
myurl = "https://huggingface.co/"
company_name = "Hugging Face"
target_language = "JPN" # any ISOβ639β2/3 code accepted by pycountry
# 1οΈβ£ Scrape & select links β 2οΈβ£ Build master content β 3οΈβ£ Generate brochure β 4οΈβ£ Translate
display(Markdown(get_brochure(company_name, myurl))) # English version
display(Markdown(get_brochure_target_language(myurl, company_name, target_language))) # Japanese version
Result: Two beautifully formatted markdown sections, ready for copyβpaste into a website, PDF, or slide deck.
Module / Function | Purpose | Important Notes |
---|---|---|
scape_webpage(myurl) |
Basic scraper for a single page (title + body). | Strips <script> , <style> , <img> , <input> tags. |
stream_gpt(prompt) / stream_claude(prompt) / stream_gemini(prompt) |
Generator that yields incremental LLM responses for streaming UI. | Removes stray backticks to keep markdown clean. |
stream_brochure(company_name, myurl, model) |
Wrapper that selects the appropriate streaming function based on the chosen model. | model argument: "GPT" , "Claude" or "Gemini" . |
get_links_user_prompt(myurl) |
Constructs a userβprompt containing all discovered links for the linkβselection LLM. | Encourages the model to output JSON. |
get_links(myurl) |
Calls OpenAI with the linkβselection system prompt and returns parsed JSON. | Uses response_format={'type':'json_object'} for reliable parsing. |
scrape_web(myurl) |
Fullβpage scraper used for secondary links (about, careers, etc.). | Returns a tuple (title, text) . |
get_all_details(myurl) |
Orchestrates scraping of landing page and all selected secondary pages, concatenating their contents. | Produces the master string fed to the brochure LLM. |
get_brochure(company_name, myurl) |
Calls OpenAI with the brochure system prompt and the master string. | Returns markdown brochure. |
system_lang_prompt |
System prompt for the translation stage, explicitly asking for humor, cultural relevance, and markdown preservation. | Template placeholders ({target_language} ) are filled at runtime. |
get_brochure_in_target_language(...) |
Builds the final user prompt for translation, including the English brochure as context. | Ensures the LLM sees the full source text. |
get_brochure_target_language(...) |
Calls OpenAI for the translation step and returns the humorous multilingual brochure. | Uses the same gpt-4o-mini model by default. |
-
Add New LLMs
Create a wrapper similar tostream_gpt
that respects the modelβs streaming API, then update the dropdown in the Gradio UI. -
Cache Scraped Pages
Implement a simple SQLite or JSON cache keyed by URL to avoid reβscraping during iterative development. -
CLI Wrapper
Wrap the core functions in aclick
βbased command line interface for nonβnotebook usage. -
Custom Prompts
Replacesystem_prompt
orsystem_lang_prompt
with your own tone (e.g., formal, technical, or brandβspecific). -
Support More Output Formats
Add PDF export viamarkdown2
βweasyprint
or HTML conversion for web embedding. -
Unit Tests
The repository includes atests/
folder skeleton. Add tests forscape_webpage
,get_links
, and language validation usingpytest
. Mock network calls withresponses
.
# Install testing extras
pip install pytest responses
# Run the test suite
pytest -v
Typical development steps:
- Create a branch
git checkout -b feature/yourβidea
- Implement & run notebooks to verify interactive behavior.
- Add/Update tests for any new function.
- Run linting (
flake8
orblack
) to keep code style consistent. - Submit a PR with a concise description and screenshots/GIFs of the UI.
Problem | Likely Cause | Fix |
---|---|---|
Invalid Website message |
URL missing scheme (https:// ) or fails validators.url() . |
Ensure you include https:// and that the domain resolves. |
LLM returns empty brochure | Prompt length exceeded model token limit or API quota exhausted. | Reduce the size of the master content (user_prompt = user_prompt[:20000] already caps it) or check your API usage. |
Translation contains raw code fences (```) | Some models keep original backticks. | The streaming functions already replace('```','') ; for static calls, add a postβprocess step. |
API key not recognized | .env not loaded or key has whitespace. |
Run load_dotenv() at the top of the notebook, and doubleβcheck the key format. |
pycountry fails to recognise language code |
Using a nonβISO code (e.g., ENGLISH instead of ENG ). |
Provide a 2βletter (en ) or 3βletter (eng ) ISO code. |
Gradio UI not reachable after share=True |
Network restrictions or firewall. | Run locally without share=True , or open the provided local URL (http://127.0.0.1:7860 ). |
This project is released under the MIT License. See the LICENSE
file for full terms.
- OpenAI GPTβ4oβmini β the primary language model for generation and translation.
- Anthropic Claudeβ3βSonnet & Google Geminiβ2.5βFlash β alternative backβends supported in the UI.
- BeautifulSoup β HTML parsing and cleanβtext extraction.
- Gradio β quick web UI scaffolding.
- pycountry β ISO language validation.
- GitHub Discussions & Issues β for community support, feature requests, and bug reports.
From a single URL to a multilingual, witty brochure, Web Brochure Builder is your silent partner in turning web noise into a concise, compelling story. Clone it, feed it a site, and let the AI spin the yarn. π
Happy building! π