3기_2주차_김윤진 by yxzkng · Pull Request #9 · HateSlop/3-crawling_practice

yxzkng · 2025-09-24T05:07:13Z

Summary by CodeRabbit

New Features
- Added a crawler for Yanolja reviews that collects ratings and review text, summarizes common words, computes average rating, and exports results to Excel.
- Added a bestseller scraper that collects titles, links, prices, and review scores and exports results to CSV.
Chores
- Updated an example notebook with executed cells showing sample parsed news results and adjusted the notebook runtime/kernel metadata.

coderabbitai · 2025-09-24T05:07:20Z

Walkthrough

Adds a Selenium-based crawler script for Yanolja reviews, a static Aladin bestseller scraper script, and updates a notebook by executing cells (capturing pip install/build logs and a rendered DataFrame) and changing the kernel metadata to Python 3.10.18.

Changes

Cohort / File(s)	Summary
Yanolja Selenium crawler `dynamic-crawling/yanolja.py`	New Selenium + BeautifulSoup script: initializes Chrome WebDriver, navigates and scrolls the Yanolja reviews page, extracts review texts and star ratings (counting filled SVG stars), performs basic Korean text cleaning and frequency counts, computes average rating, aggregates reviews/ratings into a pandas DataFrame, appends a summary row, saves to `yanolja.xlsx`, and quits the WebDriver. Uses fixed sleeps and minimal error handling.
Static Aladin bestseller scraper `static-crawling/static-crawling_assignment.py`	New procedural script: requests and parses Aladin bestseller pages with BeautifulSoup, extracts title, link, price, and review score across pages, aggregates into a pandas DataFrame, and exports CSV `static-crawling_assignment.csv`. Includes per-item try/except to skip problematic entries.
Notebook execution outputs and metadata `api-practice/newpaper.ipynb`	Updated notebook execution metadata: two code cells now have concrete `execution_count` and captured `stdout`/`execute_result` outputs (pip install/build logs for gdeltdoc/newspaper3k and a rendered DataFrame of title/url/text from GDELT/Newspaper3k parsing). Notebook kernel metadata changed to Python `3.10.18`. No core logic changes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Dev as Developer
    participant Script as yanolja.py
    participant Driver as Selenium WebDriver (Chrome)
    participant Site as Yanolja Reviews Page
    participant Parser as BeautifulSoup
    participant NLP as Text Processing
    participant DF as pandas.DataFrame
    participant FS as File System

    Dev->>Script: Run script
    Script->>Driver: Initialize Chrome WebDriver
    Script->>Driver: get(URL)
    Driver->>Site: Navigate
    Note over Script,Driver: Scroll loop to load more reviews
    loop Scroll & load
        Script->>Driver: window.scrollTo(...)
        Script-->>Script: sleep (fixed wait)
    end

    Script->>Driver: Locate review elements (Selenium)
    Script->>Driver: Extract star SVG elements
    Script-->>Script: Count filled stars (rating)
    Script->>Driver: Get page_source
    Script->>Parser: Parse HTML
    Parser-->>Script: Review texts

    Script->>NLP: Clean, remove stopwords, extract Hangul, count freq
    NLP-->>Script: Top common words

    Script->>DF: Build DataFrame (reviews + ratings)
    Script->>DF: Append summary row(s) (avg rating, top words)
    DF->>FS: Save as yanolja.xlsx
    Script->>Driver: quit()
    Script-->>Dev: Exit
    note right of Script: Basic error handling for missing stars

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I hopped through stars on Yanolja’s lane,
Gathered words and counts in gentle rain.
Flipped through Aladin’s bestseller stack,
Poured rows to CSV, then closed my pack.
A rabbit’s thump for notebooks run and saved. 🐇📊

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title Check	❓ Inconclusive	The title "3기_2주차_김윤진" is a short author/assignment label and does not describe the substantive changes in this PR (which add multiple crawling scripts and a Newspaper3k notebook), so it is too generic to clearly summarize the primary change. Because it lacks descriptive information about what was changed, the title cannot be judged as adequately informative.	Please rename the PR to a concise, descriptive sentence that highlights the main change; for example, "Add yanolja and Aladin crawlers and sample Newspaper3k notebook outputs" or "Add dynamic/static crawling scripts and Newspaper3k parsing notebook."

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

static-crawling/jisigin.py (1)

1-266: Critical: This file is Jupyter JSON saved with a .py extension. It won’t import or execute as Python.

Choose one:

Rename to .ipynb (keep as a notebook), or
Replace contents with a valid Python script.

Option B (replace with a working script, robust selectors, headers, timeouts, pagination, Excel export):

-{
- "cells": [
-  ... (notebook JSON omitted for brevity) ...
-}
+import re
+import time
+import random
+import urllib.parse
+from typing import List, Dict
+
+import pandas as pd
+import requests
+from bs4 import BeautifulSoup
+
+BASE_URL = "https://kin.naver.com"
+SEARCH_URL = f"{BASE_URL}/search/list.naver"
+HEADERS = {
+    "User-Agent": (
+        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
+        "AppleWebKit/537.36 (KHTML, like Gecko) "
+        "Chrome/127.0.0.0 Safari/537.36"
+    )
+}
+
+def fetch_html(url: str, session: requests.Session) -> str:
+    resp = session.get(url, headers=HEADERS, timeout=10)
+    resp.raise_for_status()
+    return resp.text
+
+def parse_item(tree) -> Dict[str, str]:
+    # Title/link
+    a = tree.select_one("dt > a[href]") or tree.select_one('a[class*="kin.txt"]')
+    title = a.get_text(strip=True) if a else ""
+    link = urllib.parse.urljoin(BASE_URL, a["href"]) if a and a.has_attr("href") else ""
+
+    # Date/category/hit are usually inside dd .txt_inline and siblings
+    date_el = tree.select_one(".txt_inline")
+    date_text = date_el.get_text(" ", strip=True) if date_el else ""
+
+    cat_el = tree.select_one('a[class*="kin.cat2"]') or tree.select_one("dd .txt_block a")
+    category = cat_el.get_text(strip=True) if cat_el else ""
+
+    hit_el = tree.select_one(".hit")
+    hit_text = hit_el.get_text(" ", strip=True) if hit_el else ""
+    hit_num_match = re.search(r"\d+", hit_text)
+    hit = hit_num_match.group(0) if hit_num_match else "0"
+
+    return {"title": title, "link": link, "date": date_text, "category": category, "hit": hit}
+
+def crawl_kin(query: str, pages: int = 3, delay_range=(0.5, 1.5)) -> pd.DataFrame:
+    params_base = {"query": query}
+    data: List[Dict[str, str]] = []
+    with requests.Session() as session:
+        for page in range(1, pages + 1):
+            params = dict(params_base, page=str(page))
+            url = f"{SEARCH_URL}?{urllib.parse.urlencode(params)}"
+            html = fetch_html(url, session)
+            soup = BeautifulSoup(html, "html.parser")
+            trees = soup.select(".basic1 > li > dl")
+            for tree in trees:
+                item = parse_item(tree)
+                if item["title"] and item["link"]:
+                    data.append(item)
+            time.sleep(random.uniform(*delay_range))
+    return pd.DataFrame(data, columns=["title", "link", "date", "category", "hit"])
+
+if __name__ == "__main__":
+    df = crawl_kin("삼성전자", pages=3)
+    df.to_excel("naver_kin_samsung.xlsx", index=False)
+    print(f"Saved {len(df)} rows to naver_kin_samsung.xlsx")

Note: Respect robots.txt/ToS, add longer delays if needed.

If you prefer to keep this as a teaching notebook, rename to .ipynb and finish the placeholders (response/soup/title/link/date/etc.) so it executes end-to-end.

🧹 Nitpick comments (5)

api-practice/newpaper.ipynb (2)

379-387: Harden article parsing and fix unused loop variable.

Newspaper3k frequently fails on some URLs; current code will raise and stop the loop.
Index is unused (ruff B007).

Apply:

-for index, row in articles.iterrows():
+for _idx, row in articles.iterrows():
@@
-    article = Article(url)
-    article.download()
-    article.parse()
-    text = article.text  # 기사 본문
+    try:
+        article = Article(url)
+        article.download()
+        article.parse()
+        text = article.text
+    except Exception as e:
+        # Skip problematic articles but keep the pipeline running
+        text = ""

419-420: CSV encoding for Excel compatibility.

When opening in Excel (Korean locale), utf-8-sig avoids mojibake.

-articles_df.to_csv('articles_data.csv', index=False)
+articles_df.to_csv('articles_data.csv', index=False, encoding='utf-8-sig')

static-crawling/jisigin.py (3)

25-35: Remove pip installs from the notebook/script.

Manage dependencies via requirements.txt/conda env, not inline.

-!pip install bs4
-!pip install requests
-!pip install pandas
-!pip install openpyxl

174-185: Guard against missing elements to avoid AttributeError.

Selectors may not match every item; add presence checks.

-for tree in trees:
-    title = tree.select_one("._nclicks\\:kin\\.txt").text
-    link = tree.select_one("._nclicks\\:kin\\.txt").attrs['href']
-    date = tree.select_one(".txt_inline").text
-    category = tree.select_one("._nclicks\\:kin\\.cat2").text
-    hit = tree.select_one(".hit").text.split()[1]
-    
-    # 출력
-    print(title, link, date, category, hit)
+for tree in trees:
+    a = tree.select_one("dt > a[href]") or tree.select_one('a[class*="kin.txt"]')
+    title = a.get_text(strip=True) if a else ""
+    link = a["href"] if a and a.has_attr("href") else ""
+    date_el = tree.select_one(".txt_inline")
+    date = date_el.get_text(" ", strip=True) if date_el else ""
+    cat_el = tree.select_one('a[class*="kin.cat2"]') or tree.select_one("dd .txt_block a")
+    category = cat_el.get_text(strip=True) if cat_el else ""
+    hit_el = tree.select_one(".hit")
+    hit = re.sub(r"[^0-9]", "", hit_el.get_text()) if hit_el else "0"
+    print(title, link, date, category, hit)

235-241: Save to Excel with engine and index handling.

-# pandas를 사용해 엑셀로 저장
+df.to_excel("naver_kin_samsung.xlsx", index=False)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e2a4f7 and 51c01a5.

⛔ Files ignored due to path filters (4)

api-practice/articles_data.csv is excluded by !**/*.csv
dynamic-crawling/yanolja.xlsx is excluded by !**/*.xlsx
static-crawling/jisigin.xlsx is excluded by !**/*.xlsx
static-crawling/static-crawling_assignment.csv is excluded by !**/*.csv

📒 Files selected for processing (2)

api-practice/newpaper.ipynb (3 hunks)
static-crawling/jisigin.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.1)

static-crawling/jisigin.py

1-265: Found useless expression. Either assign it to a variable or remove it.

(B018)

49-49: Undefined name null

(F821)

77-77: Undefined name null

(F821)

101-101: Undefined name null

(F821)

114-114: Undefined name null

(F821)

126-126: Undefined name null

(F821)

138-138: Undefined name null

(F821)

170-170: Undefined name null

(F821)

200-200: Undefined name null

(F821)

236-236: Undefined name null

(F821)

api-practice/newpaper.ipynb

25-25: Loop control variable index not used within loop body

Rename unused index to _index

(B007)

coderabbitai · 2025-09-24T05:19:56Z

+   "execution_count": 1,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Collecting gdeltdoc\n",
+      "  Downloading gdeltdoc-1.12.0-py3-none-any.whl.metadata (7.0 kB)\n",
+      "Requirement already satisfied: pandas>=2.2.3 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from gdeltdoc) (2.3.2)\n",
+      "Requirement already satisfied: requests>=2.32.3 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from gdeltdoc) (2.32.5)\n",
+      "Requirement already satisfied: typing-extensions>=4.13.0 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from gdeltdoc) (4.14.1)\n",
+      "Requirement already satisfied: numpy>=1.22.4 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas>=2.2.3->gdeltdoc) (2.2.6)\n",
+      "Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas>=2.2.3->gdeltdoc) (2.9.0.post0)\n",
+      "Requirement already satisfied: pytz>=2020.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas>=2.2.3->gdeltdoc) (2025.2)\n",
+      "Requirement already satisfied: tzdata>=2022.7 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas>=2.2.3->gdeltdoc) (2025.2)\n",
+      "Requirement already satisfied: six>=1.5 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from python-dateutil>=2.8.2->pandas>=2.2.3->gdeltdoc) (1.17.0)\n",
+      "Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.32.3->gdeltdoc) (3.4.3)\n",
+      "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.32.3->gdeltdoc) (3.10)\n",
+      "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.32.3->gdeltdoc) (2.5.0)\n",
+      "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.32.3->gdeltdoc) (2025.8.3)\n",
+      "Downloading gdeltdoc-1.12.0-py3-none-any.whl (17 kB)\n",
+      "Installing collected packages: gdeltdoc\n",
+      "Successfully installed gdeltdoc-1.12.0\n",
+      "Collecting newspaper3k==0.2.8\n",
+      "  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)\n",
+      "Requirement already satisfied: beautifulsoup4>=4.4.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from newspaper3k==0.2.8) (4.13.5)\n",
+      "Collecting Pillow>=3.3.0 (from newspaper3k==0.2.8)\n",
+      "  Downloading pillow-11.3.0-cp310-cp310-win_amd64.whl.metadata (9.2 kB)\n",
+      "Collecting PyYAML>=3.11 (from newspaper3k==0.2.8)\n",
+      "  Downloading PyYAML-6.0.2-cp310-cp310-win_amd64.whl.metadata (2.1 kB)\n",
+      "Collecting cssselect>=0.9.2 (from newspaper3k==0.2.8)\n",
+      "  Downloading cssselect-1.3.0-py3-none-any.whl.metadata (2.6 kB)\n",
+      "Collecting lxml>=3.6.0 (from newspaper3k==0.2.8)\n",
+      "  Downloading lxml-6.0.2-cp310-cp310-win_amd64.whl.metadata (3.7 kB)\n",
+      "Collecting nltk>=3.2.1 (from newspaper3k==0.2.8)\n",
+      "  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)\n",
+      "Requirement already satisfied: requests>=2.10.0 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from newspaper3k==0.2.8) (2.32.5)\n",
+      "Collecting feedparser>=5.2.1 (from newspaper3k==0.2.8)\n",
+      "  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)\n",
+      "Collecting tldextract>=2.0.1 (from newspaper3k==0.2.8)\n",
+      "  Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)\n",
+      "Collecting feedfinder2>=0.0.4 (from newspaper3k==0.2.8)\n",
+      "  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)\n",
+      "  Preparing metadata (setup.py): started\n",
+      "  Preparing metadata (setup.py): finished with status 'done'\n",
+      "Collecting jieba3k>=0.35.1 (from newspaper3k==0.2.8)\n",
+      "  Downloading jieba3k-0.35.1.zip (7.4 MB)\n",
+      "     ---------------------------------------- 0.0/7.4 MB ? eta -:--:--\n",
+      "     ---- ----------------------------------- 0.8/7.4 MB 4.2 MB/s eta 0:00:02\n",
+      "     ----------- ---------------------------- 2.1/7.4 MB 5.3 MB/s eta 0:00:01\n",
+      "     ------------------ --------------------- 3.4/7.4 MB 5.6 MB/s eta 0:00:01\n",
+      "     ------------------------ --------------- 4.5/7.4 MB 5.4 MB/s eta 0:00:01\n",
+      "     ------------------------------- -------- 5.8/7.4 MB 5.5 MB/s eta 0:00:01\n",
+      "     ------------------------------------ --- 6.8/7.4 MB 5.4 MB/s eta 0:00:01\n",
+      "     ---------------------------------------- 7.4/7.4 MB 5.2 MB/s  0:00:01\n",
+      "  Preparing metadata (setup.py): started\n",
+      "  Preparing metadata (setup.py): finished with status 'done'\n",
+      "Requirement already satisfied: python-dateutil>=2.5.3 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from newspaper3k==0.2.8) (2.9.0.post0)\n",
+      "Collecting tinysegmenter==0.3 (from newspaper3k==0.2.8)\n",
+      "  Downloading tinysegmenter-0.3.tar.gz (16 kB)\n",
+      "  Preparing metadata (setup.py): started\n",
+      "  Preparing metadata (setup.py): finished with status 'done'\n",
+      "Requirement already satisfied: soupsieve>1.2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from beautifulsoup4>=4.4.1->newspaper3k==0.2.8) (2.8)\n",
+      "Requirement already satisfied: typing-extensions>=4.0.0 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from beautifulsoup4>=4.4.1->newspaper3k==0.2.8) (4.14.1)\n",
+      "Requirement already satisfied: six in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from feedfinder2>=0.0.4->newspaper3k==0.2.8) (1.17.0)\n",
+      "Collecting sgmllib3k (from feedparser>=5.2.1->newspaper3k==0.2.8)\n",
+      "  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)\n",
+      "  Preparing metadata (setup.py): started\n",
+      "  Preparing metadata (setup.py): finished with status 'done'\n",
+      "Collecting click (from nltk>=3.2.1->newspaper3k==0.2.8)\n",
+      "  Downloading click-8.3.0-py3-none-any.whl.metadata (2.6 kB)\n",
+      "Collecting joblib (from nltk>=3.2.1->newspaper3k==0.2.8)\n",
+      "  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)\n",
+      "Collecting regex>=2021.8.3 (from nltk>=3.2.1->newspaper3k==0.2.8)\n",
+      "  Downloading regex-2025.9.18-cp310-cp310-win_amd64.whl.metadata (41 kB)\n",
+      "Requirement already satisfied: tqdm in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from nltk>=3.2.1->newspaper3k==0.2.8) (4.67.1)\n",
+      "Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.10.0->newspaper3k==0.2.8) (3.4.3)\n",
+      "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.10.0->newspaper3k==0.2.8) (3.10)\n",
+      "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.10.0->newspaper3k==0.2.8) (2.5.0)\n",
+      "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.10.0->newspaper3k==0.2.8) (2025.8.3)\n",
+      "Collecting requests-file>=1.4 (from tldextract>=2.0.1->newspaper3k==0.2.8)\n",
+      "  Downloading requests_file-2.1.0-py2.py3-none-any.whl.metadata (1.7 kB)\n",
+      "Collecting filelock>=3.0.8 (from tldextract>=2.0.1->newspaper3k==0.2.8)\n",
+      "  Downloading filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)\n",
+      "Requirement already satisfied: colorama in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from click->nltk>=3.2.1->newspaper3k==0.2.8) (0.4.6)\n",
+      "Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB)\n",
+      "Downloading cssselect-1.3.0-py3-none-any.whl (18 kB)\n",
+      "Downloading feedparser-6.0.12-py3-none-any.whl (81 kB)\n",
+      "Downloading lxml-6.0.2-cp310-cp310-win_amd64.whl (4.0 MB)\n",
+      "   ---------------------------------------- 0.0/4.0 MB ? eta -:--:--\n",
+      "   ------------------ --------------------- 1.8/4.0 MB 10.0 MB/s eta 0:00:01\n",
+      "   ------------------------------- -------- 3.1/4.0 MB 8.4 MB/s eta 0:00:01\n",
+      "   ---------------------------------------- 4.0/4.0 MB 7.5 MB/s  0:00:00\n",
+      "Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)\n",
+      "   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--\n",
+      "   ---------------------------------------- 1.5/1.5 MB 8.0 MB/s  0:00:00\n",
+      "Downloading pillow-11.3.0-cp310-cp310-win_amd64.whl (7.0 MB)\n",
+      "   ---------------------------------------- 0.0/7.0 MB ? eta -:--:--\n",
+      "   ------- -------------------------------- 1.3/7.0 MB 8.4 MB/s eta 0:00:01\n",
+      "   ------------------- -------------------- 3.4/7.0 MB 9.6 MB/s eta 0:00:01\n",
+      "   ------------------------------- -------- 5.5/7.0 MB 9.3 MB/s eta 0:00:01\n",
+      "   ---------------------------------------  6.8/7.0 MB 8.9 MB/s eta 0:00:01\n",
+      "   ---------------------------------------- 7.0/7.0 MB 8.4 MB/s  0:00:00\n",
+      "Downloading PyYAML-6.0.2-cp310-cp310-win_amd64.whl (161 kB)\n",
+      "Downloading regex-2025.9.18-cp310-cp310-win_amd64.whl (276 kB)\n",
+      "Downloading tldextract-5.3.0-py3-none-any.whl (107 kB)\n",
+      "Downloading filelock-3.19.1-py3-none-any.whl (15 kB)\n",
+      "Downloading requests_file-2.1.0-py2.py3-none-any.whl (4.2 kB)\n",
+      "Downloading click-8.3.0-py3-none-any.whl (107 kB)\n",
+      "Downloading joblib-1.5.2-py3-none-any.whl (308 kB)\n",
+      "Building wheels for collected packages: tinysegmenter, feedfinder2, jieba3k, sgmllib3k\n",
+      "  Building wheel for tinysegmenter (setup.py): started\n",
+      "  Building wheel for tinysegmenter (setup.py): finished with status 'done'\n",
+      "  Created wheel for tinysegmenter: filename=tinysegmenter-0.3-py3-none-any.whl size=13667 sha256=b96d7971ba095d143ebc4b574926568d40b52be2a65f9e2644cb4973712381bf\n",
+      "  Stored in directory: c:\\users\\yunji\\appdata\\local\\pip\\cache\\wheels\\c8\\d6\\6c\\384f58df48c00b9a31d638005143b5b3ac62c3d25fb1447f23\n",
+      "  Building wheel for feedfinder2 (setup.py): started\n",
+      "  Building wheel for feedfinder2 (setup.py): finished with status 'done'\n",
+      "  Created wheel for feedfinder2: filename=feedfinder2-0.0.4-py3-none-any.whl size=3409 sha256=57259ca703bda7c1326164b206623deab4bcaec1b9a5460745a21380093990a4\n",
+      "  Stored in directory: c:\\users\\yunji\\appdata\\local\\pip\\cache\\wheels\\97\\02\\e7\\a1ff1760e12bdbaab0ac824fae5c1bc933e41c4ccd6a8f8edb\n",
+      "  Building wheel for jieba3k (setup.py): started\n",
+      "  Building wheel for jieba3k (setup.py): finished with status 'done'\n",
+      "  Created wheel for jieba3k: filename=jieba3k-0.35.1-py3-none-any.whl size=7398410 sha256=c83f2b505319d55812c9e5d8a407ed9eb2c6561b96b354620a46fdb843fc3e0c\n",
+      "  Stored in directory: c:\\users\\yunji\\appdata\\local\\pip\\cache\\wheels\\7a\\c4\\0c\\12a9a314ecac499456c4c3b2fcc2f635a3b45a39dfbd240299\n",
+      "  Building wheel for sgmllib3k (setup.py): started\n",
+      "  Building wheel for sgmllib3k (setup.py): finished with status 'done'\n",
+      "  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6105 sha256=da9b75b9e1e4557799e40ba61bffcc6cb6c30b31a6d85dc5f50ade5337528fa3\n",
+      "  Stored in directory: c:\\users\\yunji\\appdata\\local\\pip\\cache\\wheels\\f0\\69\\93\\a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5\n",
+      "Successfully built tinysegmenter feedfinder2 jieba3k sgmllib3k\n",
+      "Installing collected packages: tinysegmenter, sgmllib3k, jieba3k, regex, PyYAML, Pillow, lxml, joblib, filelock, feedparser, cssselect, click, requests-file, nltk, feedfinder2, tldextract, newspaper3k\n",
+      "\n",
+      "   ---- -----------------------------------  2/17 [jieba3k]\n",
+      "   ---- -----------------------------------  2/17 [jieba3k]\n",
+      "   ---- -----------------------------------  2/17 [jieba3k]\n",
+      "   ---- -----------------------------------  2/17 [jieba3k]\n",
+      "   ------- --------------------------------  3/17 [regex]\n",
+      "   ----------- ----------------------------  5/17 [Pillow]\n",
+      "   ----------- ----------------------------  5/17 [Pillow]\n",
+      "   ----------- ----------------------------  5/17 [Pillow]\n",
+      "   ----------- ----------------------------  5/17 [Pillow]\n",
+      "   -------------- -------------------------  6/17 [lxml]\n",
+      "   -------------- -------------------------  6/17 [lxml]\n",
+      "   ---------------- -----------------------  7/17 [joblib]\n",
+      "   ---------------- -----------------------  7/17 [joblib]\n",
+      "   ---------------- -----------------------  7/17 [joblib]\n",
+      "   --------------------- ------------------  9/17 [feedparser]\n",
+      "   ------------------------- -------------- 11/17 [click]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------ --------- 13/17 [nltk]\n",
+      "   ------------------------------------- -- 16/17 [newspaper3k]\n",
+      "   ---------------------------------------- 17/17 [newspaper3k]\n",
+      "\n",
+      "Successfully installed Pillow-11.3.0 PyYAML-6.0.2 click-8.3.0 cssselect-1.3.0 feedfinder2-0.0.4 feedparser-6.0.12 filelock-3.19.1 jieba3k-0.35.1 joblib-1.5.2 lxml-6.0.2 newspaper3k-0.2.8 nltk-3.9.1 regex-2025.9.18 requests-file-2.1.0 sgmllib3k-1.0.0 tinysegmenter-0.3 tldextract-5.3.0\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "  DEPRECATION: Building 'tinysegmenter' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'tinysegmenter'. Discussion can be found at https://github.com/pypa/pip/issues/6334\n",
+      "  DEPRECATION: Building 'feedfinder2' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'feedfinder2'. Discussion can be found at https://github.com/pypa/pip/issues/6334\n",
+      "  DEPRECATION: Building 'jieba3k' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'jieba3k'. Discussion can be found at https://github.com/pypa/pip/issues/6334\n",
+      "  DEPRECATION: Building 'sgmllib3k' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'sgmllib3k'. Discussion can be found at https://github.com/pypa/pip/issues/6334\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Collecting lxml_html_clean\n",
+      "  Downloading lxml_html_clean-0.4.2-py3-none-any.whl.metadata (2.4 kB)\n",
+      "Requirement already satisfied: lxml in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from lxml_html_clean) (6.0.2)\n",
+      "Downloading lxml_html_clean-0.4.2-py3-none-any.whl (14 kB)\n",
+      "Installing collected packages: lxml_html_clean\n",
+      "Successfully installed lxml_html_clean-0.4.2\n",
+      "Requirement already satisfied: pandas in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (2.3.2)\n",
+      "Requirement already satisfied: numpy>=1.22.4 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2.2.6)\n",
+      "Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2.9.0.post0)\n",
+      "Requirement already satisfied: pytz>=2020.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2025.2)\n",
+      "Requirement already satisfied: tzdata>=2022.7 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2025.2)\n",
+      "Requirement already satisfied: six>=1.5 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n"
+     ]
+    }
+   ],


🛠️ Refactor suggestion

🧩 Analysis chain

Please remove in-notebook pip installs and committed outputs.

Installing packages inside the notebook and committing the long pip logs makes diffs noisy and harms reproducibility. Prefer requirements.txt/conda env (environment.yml) and keep notebooks output-free in VCS.

lxml_html_clean is installed but unused.

Apply this minimal change (remove the install lines from the cell):

-!pip install gdeltdoc -!pip install newspaper3k==0.2.8 -!pip install lxml_html_clean -!pip install pandas

To verify outputs are cleared and no “pip install” remains:

🏁 Script executed:

#!/bin/bash # Count total outputs in the notebook (should be 0) jq '[.cells[].outputs|length]|add' api-practice/newpaper.ipynb # Find any in-notebook installs (should return nothing) rg -nP '^\s*!\s*pip\s+install' -- api-practice/newpaper.ipynb

Length of output: 1

Clear outputs in api-practice/newpaper.ipynb — remove committed pip logs.

Verification: jq reports 4 cell outputs remain; ripgrep found no in-notebook '!pip install' lines (install commands removed but output logs are still committed). Clear outputs (e.g. jupyter nbconvert --clear-output --inplace api-practice/newpaper.ipynb or use nbstripout) and commit.

🧰 Tools

🪛 Ruff (0.13.1)

25-25: Loop control variable index not used within loop body

Rename unused index to _index

(B007)

🤖 Prompt for AI Agents

In api-practice/newpaper.ipynb around lines 22 to 215 the notebook still contains large pip install stdout/stderr outputs (pip logs) committed as cell outputs; remove those outputs by clearing notebook outputs (e.g. run jupyter nbconvert --clear-output --inplace api-practice/newpaper.ipynb or use nbstripout) then verify with jq/ripgrep that cell outputs are empty and no '!pip install' outputs remain, commit the cleaned notebook, and optionally add nbstripout or a .gitattributes rule to prevent future committed outputs.

coderabbitai · 2025-09-24T05:19:56Z

+   "execution_count": 4,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>title</th>\n",
+       "      <th>url</th>\n",
+       "      <th>text</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>The C . E . O . s Who Just Wont Quit - The New...</td>\n",
+       "      <td>https://www.nytimes.com/2024/05/09/magazine/fo...</td>\n",
+       "      <td>Of the many riddles that confront corporate ch...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>C . Gordon Bell , Creator of a Personal Comput...</td>\n",
+       "      <td>https://www.nytimes.com/2024/05/21/technology/...</td>\n",
+       "      <td>C. Gordon Bell, a technology visionary whose c...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Wayve , an A . I . Start - Up for Autonomous D...</td>\n",
+       "      <td>https://www.nytimes.com/2024/05/06/technology/...</td>\n",
+       "      <td>Wayve, a London maker of artificial intelligen...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Biden to Announce A . I . Center in Wisconsin ...</td>\n",
+       "      <td>https://www.nytimes.com/2024/05/08/us/politics...</td>\n",
+       "      <td>President Biden on Wednesday announced the cre...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>How Rich Candidates Burned Cash on Running for...</td>\n",
+       "      <td>https://www.nytimes.com/2024/05/16/us/politics...</td>\n",
+       "      <td>The costly realm of campaign politics has clai...</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                               title  \\\n",
+       "0  The C . E . O . s Who Just Wont Quit - The New...   \n",
+       "1  C . Gordon Bell , Creator of a Personal Comput...   \n",
+       "2  Wayve , an A . I . Start - Up for Autonomous D...   \n",
+       "3  Biden to Announce A . I . Center in Wisconsin ...   \n",
+       "4  How Rich Candidates Burned Cash on Running for...   \n",
+       "\n",
+       "                                                 url  \\\n",
+       "0  https://www.nytimes.com/2024/05/09/magazine/fo...   \n",
+       "1  https://www.nytimes.com/2024/05/21/technology/...   \n",
+       "2  https://www.nytimes.com/2024/05/06/technology/...   \n",
+       "3  https://www.nytimes.com/2024/05/08/us/politics...   \n",
+       "4  https://www.nytimes.com/2024/05/16/us/politics...   \n",
+       "\n",
+       "                                                text  \n",
+       "0  Of the many riddles that confront corporate ch...  \n",
+       "1  C. Gordon Bell, a technology visionary whose c...  \n",
+       "2  Wayve, a London maker of artificial intelligen...  \n",
+       "3  President Biden on Wednesday announced the cre...  \n",
+       "4  The costly realm of campaign politics has clai...  "
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],


⚠️ Potential issue

Avoid committing scraped article text in outputs (potential licensing risk).

The rendered table includes “text” content from NYTimes. Commit history will store this third‑party content. Clear outputs and, if you must preview, display only title/url.

Apply this change to the display line to avoid rendering full text in outputs:

-articles_df.head() # 첫 5개의 기사 출력 +articles_df[['title','url']].head()

Then clear all notebook outputs before committing (see verification script in prior comment).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"execution_count": 4,

"metadata": {},

"outputs": [],

"outputs": [

{

"data": {

"text/html": [

"<div>\n",

"<style scoped>\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"</style>\n",

"<table border=\"1\" class=\"dataframe\">\n",

" <thead>\n",

" <tr style=\"text-align: right;\">\n",

" <th></th>\n",

" <th>title</th>\n",

" <th>url</th>\n",

" <th>text</th>\n",

" </tr>\n",

" </thead>\n",

" <tbody>\n",

" <tr>\n",

" <th>0</th>\n",

" <td>The C . E . O . s Who Just Wont Quit - The New...</td>\n",

" <td>https://www.nytimes.com/2024/05/09/magazine/fo...</td>\n",

" <td>Of the many riddles that confront corporate ch...</td>\n",

" </tr>\n",

" <tr>\n",

" <th>1</th>\n",

" <td>C . Gordon Bell , Creator of a Personal Comput...</td>\n",

" <td>https://www.nytimes.com/2024/05/21/technology/...</td>\n",

" <td>C. Gordon Bell, a technology visionary whose c...</td>\n",

" </tr>\n",

" <tr>\n",

" <th>2</th>\n",

" <td>Wayve , an A . I . Start - Up for Autonomous D...</td>\n",

" <td>https://www.nytimes.com/2024/05/06/technology/...</td>\n",

" <td>Wayve, a London maker of artificial intelligen...</td>\n",

" </tr>\n",

" <tr>\n",

" <th>3</th>\n",

" <td>Biden to Announce A . I . Center in Wisconsin ...</td>\n",

" <td>https://www.nytimes.com/2024/05/08/us/politics...</td>\n",

" <td>President Biden on Wednesday announced the cre...</td>\n",

" </tr>\n",

" <tr>\n",

" <th>4</th>\n",

" <td>How Rich Candidates Burned Cash on Running for...</td>\n",

" <td>https://www.nytimes.com/2024/05/16/us/politics...</td>\n",

" <td>The costly realm of campaign politics has clai...</td>\n",

" </tr>\n",

" </tbody>\n",

"</table>\n",

"</div>"

],

"text/plain": [

" title \\\n",

"0 The C . E . O . s Who Just Wont Quit - The New... \n",

"1 C . Gordon Bell , Creator of a Personal Comput... \n",

"2 Wayve , an A . I . Start - Up for Autonomous D... \n",

"3 Biden to Announce A . I . Center in Wisconsin ... \n",

"4 How Rich Candidates Burned Cash on Running for... \n",

"\n",

" url \\\n",

"0 https://www.nytimes.com/2024/05/09/magazine/fo... \n",

"1 https://www.nytimes.com/2024/05/21/technology/... \n",

"2 https://www.nytimes.com/2024/05/06/technology/... \n",

"3 https://www.nytimes.com/2024/05/08/us/politics... \n",

"4 https://www.nytimes.com/2024/05/16/us/politics... \n",

"\n",

" text \n",

"0 Of the many riddles that confront corporate ch... \n",

"1 C. Gordon Bell, a technology visionary whose c... \n",

"2 Wayve, a London maker of artificial intelligen... \n",

"3 President Biden on Wednesday announced the cre... \n",

"4 The costly realm of campaign politics has clai... "

]

},

"execution_count": 4,

"metadata": {},

"output_type": "execute_result"

}

],

articles_df[['title','url']].head() # 첫 5개의 기사 출력

🤖 Prompt for AI Agents

In api-practice/newpaper.ipynb around lines 278–369 the notebook output is rendering full scraped article "text" (NYTimes) which risks committing third‑party content; update the display cell to only show title and url (remove or omit the "text" column from the DataFrame/display call), then clear all notebook outputs (run "Clear All Outputs" or nbconvert/nbstripout) and re-run the verification script referenced in the prior comment to confirm no outputs remain before committing.

coderabbitai · 2025-09-24T05:19:56Z

+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 여러 페이지에서 정보 추출\n",
+    "data = []\n",
+    "for page_num in range(1, 4):  # 1~3페이지 크롤링\n",
+    "    url = f\"https://kin.naver.com/search/list.naver?query=%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90&page={page_num}\"\n",
+    "    response = requests.get(url)\n",
+    "    html = response.text\n",
+    "    soup = BeautifulSoup(html, 'html.parser')\n",
+    "    trees = soup.select(\".basic1 > li > dl\")\n",
+    "    \n",
+    "    for tree in trees:\n",
+    "        ################\n",
+    "        \n",
+    "        # 데이터를 리스트에 추가\n",
+    "        data.append([title, link, date, category, hit])\n",
+    "\n",
+    "# DataFrame으로 변환\n",
+    "import pandas as pd\n",
+    "df = "
+   ]


⚠️ Potential issue

Complete the pagination block and build the DataFrame.

Currently appends undefined variables.

data = [] for page_num in range(1, 4): # 1~3페이지 크롤링 url = f"https://kin.naver.com/search/list.naver?query=%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90&page={page_num}" response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'html.parser') trees = soup.select(".basic1 > li > dl") for tree in trees: - ################ - - # 데이터를 리스트에 추가 - data.append([title, link, date, category, hit]) + a = tree.select_one("dt > a[href]") or tree.select_one('a[class*="kin.txt"]') + title = a.get_text(strip=True) if a else "" + link = a["href"] if a and a.has_attr("href") else "" + date_el = tree.select_one(".txt_inline") + date = date_el.get_text(" ", strip=True) if date_el else "" + cat_el = tree.select_one('a[class*="kin.cat2"]') or tree.select_one("dd .txt_block a") + category = cat_el.get_text(strip=True) if cat_el else "" + hit_el = tree.select_one(".hit") + hit = re.sub(r"[^0-9]", "", hit_el.get_text()) if hit_el else "0" + data.append([title, link, date, category, hit]) # DataFrame으로 변환 import pandas as pd -df = +df = pd.DataFrame(data, columns=["title", "link", "date", "category", "hit"])

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.13.1)

200-200: Undefined name null

(F821)

🤖 Prompt for AI Agents

In static-crawling/jisigin.py around lines 199 to 222, the pagination loop currently appends undefined variables (title, link, date, category, hit) and does not build the DataFrame; update the loop to extract each field from each `tree` (e.g., select title text, href for link, parse date text, category text, and hit count or default), handle missing elements with safe defaults, append a fully-populated row like [title, link, date, category, hit] for each item, and after the loop import pandas and create `df = pd.DataFrame(data, columns=[...])` (optionally set dtypes and save/export as needed).

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (7)

dynamic-crawling/yanolja.py (7)
6-8: Add explicit wait imports for reliable loading.

Apply this diff:
 from selenium import webdriver
 from selenium.webdriver.common.by import By
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.support import expected_conditions as EC
 import time
17-19: Replace sleep with explicit wait.

More reliable than fixed delays.

Apply this diff:
-# 페이지 로딩을 위해 대기
-time.sleep(3)
+# 페이지 로딩: 리뷰 컨테이너가 나타날 때까지 대기
+WebDriverWait(driver, 10).until(
+    EC.presence_of_element_located((By.CSS_SELECTOR, ".review-item-container"))
+)
38-45: Brittle CSS selectors.

Classes like .review-item-container on Yanolja are likely ephemeral. Prefer stable attributes (e.g., data-*, aria-labels, semantic roles) or text anchors to reduce breakage.

61-66: Ensure ratings and reviews align before zipping.

Prevent silent row drops if lengths differ.

Apply this diff:
-# 별점과 리뷰를 결합하여 리스트 생성
-data = list(zip(ratings, reviews))
+# 별점과 리뷰를 결합하기 전에 길이 검증
+assert len(ratings) == len(reviews), f"Mismatched lengths: ratings={len(ratings)}, reviews={len(reviews)}"
+data = list(zip(ratings, reviews))
97-101: Write reviews and summary to separate Excel sheets.

Concatenating heterogeneous DataFrames creates sparse columns. Use multi-sheet output.

Apply this diff:
-# 최종 DataFrame 결합
-final_df = pd.concat([df_reviews, summary_df], ignore_index=True)
-
-# Excel 파일로 저장
-final_df.to_excel('yanolja.xlsx', index = False)
+# Excel 파일로 저장(시트 분리)
+with pd.ExcelWriter('yanolja.xlsx', engine='openpyxl') as writer:
+    df_reviews.to_excel(writer, sheet_name='reviews', index=False)
+    summary_df.to_excel(writer, sheet_name='summary', index=False)
10-16: Always quit the driver via try/finally.

Prevents leaked Chrome processes on exceptions.

Example:
driver = webdriver.Chrome()
try:
    driver.get(url)
    # ... rest of scraping ...
finally:
    driver.quit()
Also applies to: 103-104

13-16: Check site ToS/robots and throttle requests.

Be respectful of Yanolja’s policies; consider random sleeps/backoff and a user-agent.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51c01a5 and e1cb069.

📒 Files selected for processing (1)

dynamic-crawling/yanolja.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.1)

dynamic-crawling/yanolja.py

1-1: Expected a statement

(invalid-syntax)

1-1: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

1-1: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

2-2: Expected a statement

(invalid-syntax)

2-2: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

2-2: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

3-3: Expected a statement

(invalid-syntax)

3-3: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

3-3: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

4-4: Expected a statement

(invalid-syntax)

4-4: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

4-4: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

🔇 Additional comments (1)

dynamic-crawling/yanolja.py (1)
21-27: Delete unused/broken Selenium query (wrong API signature).

find_elements("css selector", ...) is invalid and currently unused; it will raise at runtime.

Apply this diff:
 for _ in range(scroll_count):
     ######## your code here ########
     driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
-    # 리뷰 요소 가져오기 (예시: 리뷰 텍스트가 들어있는 div 태그)
-    reviews = driver.find_elements("css selector", ".css-vjs6b8")
     time.sleep(1)  # 스크롤 이후 대기
Likely an incorrect or invalid review comment.

coderabbitai · 2025-09-24T07:15:21Z

+!pip install selenium
+!pip install bs4
+!pip install pandas
+!pip install openpyxl


⚠️ Potential issue

Remove notebook-style pip commands (syntax error in .py).

These lines break the script (see Ruff invalid-syntax). Manage dependencies via requirements.txt, not in-code.

Apply this diff:

-!pip install selenium -!pip install bs4 -!pip install pandas -!pip install openpyxl

Optionally add a requirements.txt:

selenium>=4.23 beautifulsoup4>=4.12 pandas>=2.2 openpyxl>=3.1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

!pip install selenium

!pip install bs4

!pip install pandas

!pip install openpyxl

🧰 Tools

🪛 Ruff (0.13.1)

1-1: Expected a statement

(invalid-syntax)

1-1: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

1-1: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

2-2: Expected a statement

(invalid-syntax)

2-2: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

2-2: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

3-3: Expected a statement

(invalid-syntax)

3-3: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

3-3: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

4-4: Expected a statement

(invalid-syntax)

4-4: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

4-4: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

🤖 Prompt for AI Agents

In dynamic-crawling/yanolja.py around lines 1 to 4, the notebook-style "!pip install ..." lines cause invalid Python syntax and must be removed; delete those lines and instead add a requirements.txt (suggested contents: selenium>=4.23, beautifulsoup4>=4.12, pandas>=2.2, openpyxl>=3.1) at the repo root, and update project docs/README or CI to run pip install -r requirements.txt so dependencies are managed outside the .py file.

coderabbitai · 2025-09-24T07:15:21Z

+for review_container in reviews_class:
+    star_container = review_container.select_one(".css-rz7kwu")
+    if not star_container:
+        ratings.append(0)
+        continue
+    stars = star_container.find_all("svg")
+    filled_stars = sum(
+        1 for star in stars if not (star.find("path") and star.find("path").get("fill-rule") == "evenodd")
+    ) #stars 안의 각 star에 대해 path의 fill-rule 속성이 evenodd이면 1을 더함
+    ratings.append(filled_stars)
+


⚠️ Potential issue

Star-count logic inverted vs comment.

The comment says “count when fill-rule == evenodd”, but code counts when it’s NOT. Fix to avoid wrong ratings.

Apply this diff:

- stars = star_container.find_all("svg") - filled_stars = sum( - 1 for star in stars if not (star.find("path") and star.find("path").get("fill-rule") == "evenodd") - ) #stars 안의 각 star에 대해 path의 fill-rule 속성이 evenodd이면 1을 더함 + stars = star_container.find_all("svg") + filled_stars = sum( + 1 for star in stars + if (star.find("path") and star.find("path").get("fill-rule") == "evenodd") + ) # fill-rule == "evenodd" 인 경우만 카운트

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for review_container in reviews_class:

star_container = review_container.select_one(".css-rz7kwu")

if not star_container:

ratings.append(0)

continue

stars = star_container.find_all("svg")

filled_stars = sum(

1 for star in stars if not (star.find("path") and star.find("path").get("fill-rule") == "evenodd")

) #stars 안의 각 star에 대해 path의 fill-rule 속성이 evenodd이면 1을 더함

ratings.append(filled_stars)

for review_container in reviews_class:

star_container = review_container.select_one(".css-rz7kwu")

if not star_container:

ratings.append(0)

continue

stars = star_container.find_all("svg")

filled_stars = sum(

1 for star in stars

if (star.find("path") and star.find("path").get("fill-rule") == "evenodd")

) # fill-rule == "evenodd" 인 경우만 카운트

ratings.append(filled_stars)

🤖 Prompt for AI Agents

In dynamic-crawling/yanolja.py around lines 48 to 58, the star-count logic currently increments when the path's fill-rule is NOT "evenodd", which contradicts the comment — change the condition so it counts a star only when the path exists and path.get("fill-rule") == "evenodd". Update the generator expression to explicitly check equality (safely handling missing path or attribute) and append that count as before.

coderabbitai · 2025-09-24T07:15:21Z

+# 평균 별점 계산
+average_rating = sum(ratings) / len(ratings)
+


⚠️ Potential issue

Guard against division by zero when no ratings found.

Apply this diff:

-# 평균 별점 계산 -average_rating = sum(ratings) / len(ratings) +# 평균 별점 계산 +average_rating = round(sum(ratings) / len(ratings), 2) if ratings else None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# 평균 별점 계산

average_rating = sum(ratings) / len(ratings)

# 평균 별점 계산

average_rating = round(sum(ratings) / len(ratings), 2) if ratings else None

🤖 Prompt for AI Agents

In dynamic-crawling/yanolja.py around lines 67 to 69, guard against division by zero when computing average_rating by checking if ratings is empty before dividing; if len(ratings) == 0 set average_rating to a safe default (e.g., 0 or None) otherwise compute sum(ratings) / len(ratings), and optionally add a brief log or comment to clarify the chosen default.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (7)

static-crawling/static-crawling_assignment.py (7)

5-7: Add a browser-like User-Agent (and optional time) for polite, reliable requests.

 from bs4 import BeautifulSoup
 import requests
+
+import time
+
+# Use a realistic UA to avoid being blocked by basic bot filters.
+HEADERS = {
+    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36"
+}

8-13: Drop the initial one-off request; it’s redundant.

You fetch pages again below; this block isn’t used after cleanup.

-# 알라딘 베스트셀러 페이지 URL
-url = "https://www.aladin.co.kr/shop/common/wbest.aspx?BestType=Bestseller&BranchType=1&CID=0&page=1&cnt=1000&SortOrder=1"
-response = requests.get(url) # 요청 보내기
-html =  response.text # 응답 받은 HTML 문서
-soup = BeautifulSoup(html, "html.parser") # BeautifulSoup으로 파싱

14-31: Remove single-item scrape and prints.

Not robust (no error handling) and duplicated by the loop below.

-tree = soup.select_one("div.ss_book_box")
-
-# 제목과 링크 추출
-title_tag = tree.select_one(".bo3")
-link = title_tag.get("href")
-title = title_tag.text
-
-# 할인가와 별점 추출
-price_tag = tree.select_one(".ss_p2")
-price = price_tag.text
-review_tag = tree.select_one(".star_score")
-print("제목:",title)
-print("링크:", link)
-
-#price_tag.text, review_tag.text
-review = review_tag.text
-print("할인가:",price)
-print("별점:",review)

33-43: Remove debug print loop (or move behind a debug flag).

Avoid noisy console output and duplicated work; the data collection loop below suffices.

-trees = soup.select("div.ss_book_box")
-for tree in trees:
-    try:
-        title = tree.select_one(".bo3")
-        title_text = title.text
-        title_link = title.attrs['href']
-        price = (tree.select_one(".ss_p2")).text
-        review = (tree.select_one(".star_score")).text
-        print(title_text, title_link, price, review)
-    except: continue

53-63: Avoid bare except, sanitize text, and handle missing nodes.

This removes E701, improves robustness, and cleans values.

     trees = soup.select("div.ss_book_box")
     for tree in trees:
-        try:
-            title = tree.select_one(".bo3")
-            title_text = title.text
-            title_link = title.attrs['href']
-            price = (tree.select_one(".ss_p2")).text
-            review = (tree.select_one(".star_score")).text
-
-            datas.append([title_text, title_link, price, review])
-
-        except: continue
+        try:
+            title_tag = tree.select_one(".bo3")
+            title_text = title_tag.get_text(strip=True)
+            title_link = title_tag.get('href', '')
+            price = tree.select_one(".ss_p2").get_text(strip=True)
+            review_tag = tree.select_one(".star_score")
+            review = review_tag.get_text(strip=True) if review_tag else ""
+            datas.append([title_text, title_link, price, review])
+        except (AttributeError, KeyError):
+            continue

65-68: Add CSV encoding for Korean (Excel compatibility).

 # csv 파일로 저장해 봅시다.
-df.to_csv('static-crawling_assignment.csv', index = False)
+df.to_csv('static-crawling_assignment.csv', index=False, encoding='utf-8-sig')

46-48: Crawl responsibly (robots/ToS, throttle, page size).

Confirm permission per robots.txt/ToS; keep throttling; consider reducing cnt=1000 to limit load.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1cb069 and 382876d.

📒 Files selected for processing (1)

static-crawling/static-crawling_assignment.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.1)

static-crawling/static-crawling_assignment.py

1-1: Expected a statement

(invalid-syntax)

1-1: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

1-1: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

2-2: Expected a statement

(invalid-syntax)

2-2: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

2-2: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

3-3: Expected a statement

(invalid-syntax)

3-3: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

3-3: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

42-42: Multiple statements on one line (colon)

(E701)

63-63: Multiple statements on one line (colon)

(E701)

coderabbitai · 2025-09-24T13:05:36Z

+!pip install bs4
+!pip install requests
+!pip install pandas


⚠️ Potential issue

Remove Jupyter magics (!pip ...): invalid in .py.

These lines break execution; manage dependencies outside the script (e.g., requirements.txt).

-!pip install bs4 -!pip install requests -!pip install pandas

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

!pip install bs4

!pip install requests

!pip install pandas

🧰 Tools

🪛 Ruff (0.13.1)

1-1: Expected a statement

(invalid-syntax)

1-1: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

1-1: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

2-2: Expected a statement

(invalid-syntax)

2-2: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

2-2: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

3-3: Expected a statement

(invalid-syntax)

3-3: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

3-3: Simple statements must be separated by newlines or semicolons

(invalid-syntax)

🤖 Prompt for AI Agents

In static-crawling/static-crawling_assignment.py around lines 1 to 3, the file contains Jupyter magic shell commands (!pip install ...) which are invalid in a .py module; remove these lines and manage dependencies externally (e.g., add required packages to requirements.txt or install via environment/CI) so the script contains only valid Python import and code.

coderabbitai · 2025-09-24T13:05:36Z

+datas = []
+for page_num in range(1, 4):
+    url = f"https://www.aladin.co.kr/shop/common/wbest.aspx?BestType=Bestseller&BranchType=1&CID=0&page=1&cnt=1000&SortOrder={page_num}"
+    response = requests.get(url)
+    html = response.text
+    soup = BeautifulSoup(html, 'html.parser')
+    trees = soup.select("div.ss_book_box")


⚠️ Potential issue

Fix pagination bug: you’re varying SortOrder, not page. Also add headers, timeout, status check, and a short delay.

datas = [] for page_num in range(1, 4): - url = f"https://www.aladin.co.kr/shop/common/wbest.aspx?BestType=Bestseller&BranchType=1&CID=0&page=1&cnt=1000&SortOrder={page_num}" - response = requests.get(url) + url = f"https://www.aladin.co.kr/shop/common/wbest.aspx?BestType=Bestseller&BranchType=1&CID=0&page={page_num}&cnt=1000&SortOrder=1" + response = requests.get(url, headers=HEADERS, timeout=10) + response.raise_for_status() html = response.text soup = BeautifulSoup(html, 'html.parser') + time.sleep(1) # be polite

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In static-crawling/static-crawling_assignment.py around lines 46 to 52, the loop incorrectly varies the SortOrder parameter instead of the page parameter and lacks request hardening; change the URL to vary page (use the loop variable in the page=... query), add a headers dict with a realistic User-Agent, pass a timeout to requests.get, check response.status_code (or call response.raise_for_status()) before parsing, and add a short delay (e.g., time.sleep(1)) between requests to avoid hammering the server.

softee220 · 2025-09-25T05:18:17Z

+    for tree in trees:
+        try:
+            title = tree.select_one(".bo3")
+            title_text = title.text
+            title_link = title.attrs['href']
+            price = (tree.select_one(".ss_p2")).text
+            review = (tree.select_one(".star_score")).text
+
+            datas.append([title_text, title_link, price, review])
+
+        except: continue


try/except 를 쓰면 항목을 통째로 건너뛸 수 있어서 대신 어떤 값이 없으면 None을 넣도록 하는 것은 어떨까요?

softee220 · 2025-09-25T05:59:03Z

+final_df.to_excel('yanolja.xlsx', index = False)
+
+# 드라이버 종료
+driver.quit()


수고하셨습니다! :)

2주차 과제

7207fe4

2주차 과제

51c01a5

coderabbitai bot reviewed Sep 24, 2025

View reviewed changes

2주차 과제(2)

e1cb069

coderabbitai bot reviewed Sep 24, 2025

View reviewed changes

2주차 과제 제출(3)

382876d

coderabbitai bot reviewed Sep 24, 2025

View reviewed changes

softee220 reviewed Sep 25, 2025

View reviewed changes

		# 평균 별점 계산
		average_rating = sum(ratings) / len(ratings)

Conversation

yxzkng commented Sep 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

softee220 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

softee220 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yxzkng commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 24, 2025 •

edited

Loading