Conversation
WalkthroughAdds a Selenium-based crawler script for Yanolja reviews, a static Aladin bestseller scraper script, and updates a notebook by executing cells (capturing pip install/build logs and a rendered DataFrame) and changing the kernel metadata to Python 3.10.18. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Developer
participant Script as yanolja.py
participant Driver as Selenium WebDriver (Chrome)
participant Site as Yanolja Reviews Page
participant Parser as BeautifulSoup
participant NLP as Text Processing
participant DF as pandas.DataFrame
participant FS as File System
Dev->>Script: Run script
Script->>Driver: Initialize Chrome WebDriver
Script->>Driver: get(URL)
Driver->>Site: Navigate
Note over Script,Driver: Scroll loop to load more reviews
loop Scroll & load
Script->>Driver: window.scrollTo(...)
Script-->>Script: sleep (fixed wait)
end
Script->>Driver: Locate review elements (Selenium)
Script->>Driver: Extract star SVG elements
Script-->>Script: Count filled stars (rating)
Script->>Driver: Get page_source
Script->>Parser: Parse HTML
Parser-->>Script: Review texts
Script->>NLP: Clean, remove stopwords, extract Hangul, count freq
NLP-->>Script: Top common words
Script->>DF: Build DataFrame (reviews + ratings)
Script->>DF: Append summary row(s) (avg rating, top words)
DF->>FS: Save as yanolja.xlsx
Script->>Driver: quit()
Script-->>Dev: Exit
note right of Script: Basic error handling for missing stars
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
static-crawling/jisigin.py (1)
1-266: Critical: This file is Jupyter JSON saved with a .py extension. It won’t import or execute as Python.Choose one:
- Rename to .ipynb (keep as a notebook), or
- Replace contents with a valid Python script.
Option B (replace with a working script, robust selectors, headers, timeouts, pagination, Excel export):
-{ - "cells": [ - ... (notebook JSON omitted for brevity) ... -} +import re +import time +import random +import urllib.parse +from typing import List, Dict + +import pandas as pd +import requests +from bs4 import BeautifulSoup + +BASE_URL = "https://kin.naver.com" +SEARCH_URL = f"{BASE_URL}/search/list.naver" +HEADERS = { + "User-Agent": ( + "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " + "AppleWebKit/537.36 (KHTML, like Gecko) " + "Chrome/127.0.0.0 Safari/537.36" + ) +} + +def fetch_html(url: str, session: requests.Session) -> str: + resp = session.get(url, headers=HEADERS, timeout=10) + resp.raise_for_status() + return resp.text + +def parse_item(tree) -> Dict[str, str]: + # Title/link + a = tree.select_one("dt > a[href]") or tree.select_one('a[class*="kin.txt"]') + title = a.get_text(strip=True) if a else "" + link = urllib.parse.urljoin(BASE_URL, a["href"]) if a and a.has_attr("href") else "" + + # Date/category/hit are usually inside dd .txt_inline and siblings + date_el = tree.select_one(".txt_inline") + date_text = date_el.get_text(" ", strip=True) if date_el else "" + + cat_el = tree.select_one('a[class*="kin.cat2"]') or tree.select_one("dd .txt_block a") + category = cat_el.get_text(strip=True) if cat_el else "" + + hit_el = tree.select_one(".hit") + hit_text = hit_el.get_text(" ", strip=True) if hit_el else "" + hit_num_match = re.search(r"\d+", hit_text) + hit = hit_num_match.group(0) if hit_num_match else "0" + + return {"title": title, "link": link, "date": date_text, "category": category, "hit": hit} + +def crawl_kin(query: str, pages: int = 3, delay_range=(0.5, 1.5)) -> pd.DataFrame: + params_base = {"query": query} + data: List[Dict[str, str]] = [] + with requests.Session() as session: + for page in range(1, pages + 1): + params = dict(params_base, page=str(page)) + url = f"{SEARCH_URL}?{urllib.parse.urlencode(params)}" + html = fetch_html(url, session) + soup = BeautifulSoup(html, "html.parser") + trees = soup.select(".basic1 > li > dl") + for tree in trees: + item = parse_item(tree) + if item["title"] and item["link"]: + data.append(item) + time.sleep(random.uniform(*delay_range)) + return pd.DataFrame(data, columns=["title", "link", "date", "category", "hit"]) + +if __name__ == "__main__": + df = crawl_kin("삼성전자", pages=3) + df.to_excel("naver_kin_samsung.xlsx", index=False) + print(f"Saved {len(df)} rows to naver_kin_samsung.xlsx")Note: Respect robots.txt/ToS, add longer delays if needed.
If you prefer to keep this as a teaching notebook, rename to .ipynb and finish the placeholders (response/soup/title/link/date/etc.) so it executes end-to-end.
🧹 Nitpick comments (5)
api-practice/newpaper.ipynb (2)
379-387: Harden article parsing and fix unused loop variable.
- Newspaper3k frequently fails on some URLs; current code will raise and stop the loop.
- Index is unused (ruff B007).
Apply:
-for index, row in articles.iterrows(): +for _idx, row in articles.iterrows(): @@ - article = Article(url) - article.download() - article.parse() - text = article.text # 기사 본문 + try: + article = Article(url) + article.download() + article.parse() + text = article.text + except Exception as e: + # Skip problematic articles but keep the pipeline running + text = ""
419-420: CSV encoding for Excel compatibility.When opening in Excel (Korean locale), utf-8-sig avoids mojibake.
-articles_df.to_csv('articles_data.csv', index=False) +articles_df.to_csv('articles_data.csv', index=False, encoding='utf-8-sig')static-crawling/jisigin.py (3)
25-35: Remove pip installs from the notebook/script.Manage dependencies via requirements.txt/conda env, not inline.
-!pip install bs4 -!pip install requests -!pip install pandas -!pip install openpyxl
174-185: Guard against missing elements to avoid AttributeError.Selectors may not match every item; add presence checks.
-for tree in trees: - title = tree.select_one("._nclicks\\:kin\\.txt").text - link = tree.select_one("._nclicks\\:kin\\.txt").attrs['href'] - date = tree.select_one(".txt_inline").text - category = tree.select_one("._nclicks\\:kin\\.cat2").text - hit = tree.select_one(".hit").text.split()[1] - - # 출력 - print(title, link, date, category, hit) +for tree in trees: + a = tree.select_one("dt > a[href]") or tree.select_one('a[class*="kin.txt"]') + title = a.get_text(strip=True) if a else "" + link = a["href"] if a and a.has_attr("href") else "" + date_el = tree.select_one(".txt_inline") + date = date_el.get_text(" ", strip=True) if date_el else "" + cat_el = tree.select_one('a[class*="kin.cat2"]') or tree.select_one("dd .txt_block a") + category = cat_el.get_text(strip=True) if cat_el else "" + hit_el = tree.select_one(".hit") + hit = re.sub(r"[^0-9]", "", hit_el.get_text()) if hit_el else "0" + print(title, link, date, category, hit)
235-241: Save to Excel with engine and index handling.-# pandas를 사용해 엑셀로 저장 +df.to_excel("naver_kin_samsung.xlsx", index=False)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
api-practice/articles_data.csvis excluded by!**/*.csvdynamic-crawling/yanolja.xlsxis excluded by!**/*.xlsxstatic-crawling/jisigin.xlsxis excluded by!**/*.xlsxstatic-crawling/static-crawling_assignment.csvis excluded by!**/*.csv
📒 Files selected for processing (2)
api-practice/newpaper.ipynb(3 hunks)static-crawling/jisigin.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
static-crawling/jisigin.py
1-265: Found useless expression. Either assign it to a variable or remove it.
(B018)
49-49: Undefined name null
(F821)
77-77: Undefined name null
(F821)
101-101: Undefined name null
(F821)
114-114: Undefined name null
(F821)
126-126: Undefined name null
(F821)
138-138: Undefined name null
(F821)
170-170: Undefined name null
(F821)
200-200: Undefined name null
(F821)
236-236: Undefined name null
(F821)
api-practice/newpaper.ipynb
25-25: Loop control variable index not used within loop body
Rename unused index to _index
(B007)
| "execution_count": 1, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "Collecting gdeltdoc\n", | ||
| " Downloading gdeltdoc-1.12.0-py3-none-any.whl.metadata (7.0 kB)\n", | ||
| "Requirement already satisfied: pandas>=2.2.3 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from gdeltdoc) (2.3.2)\n", | ||
| "Requirement already satisfied: requests>=2.32.3 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from gdeltdoc) (2.32.5)\n", | ||
| "Requirement already satisfied: typing-extensions>=4.13.0 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from gdeltdoc) (4.14.1)\n", | ||
| "Requirement already satisfied: numpy>=1.22.4 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas>=2.2.3->gdeltdoc) (2.2.6)\n", | ||
| "Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas>=2.2.3->gdeltdoc) (2.9.0.post0)\n", | ||
| "Requirement already satisfied: pytz>=2020.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas>=2.2.3->gdeltdoc) (2025.2)\n", | ||
| "Requirement already satisfied: tzdata>=2022.7 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas>=2.2.3->gdeltdoc) (2025.2)\n", | ||
| "Requirement already satisfied: six>=1.5 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from python-dateutil>=2.8.2->pandas>=2.2.3->gdeltdoc) (1.17.0)\n", | ||
| "Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.32.3->gdeltdoc) (3.4.3)\n", | ||
| "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.32.3->gdeltdoc) (3.10)\n", | ||
| "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.32.3->gdeltdoc) (2.5.0)\n", | ||
| "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.32.3->gdeltdoc) (2025.8.3)\n", | ||
| "Downloading gdeltdoc-1.12.0-py3-none-any.whl (17 kB)\n", | ||
| "Installing collected packages: gdeltdoc\n", | ||
| "Successfully installed gdeltdoc-1.12.0\n", | ||
| "Collecting newspaper3k==0.2.8\n", | ||
| " Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)\n", | ||
| "Requirement already satisfied: beautifulsoup4>=4.4.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from newspaper3k==0.2.8) (4.13.5)\n", | ||
| "Collecting Pillow>=3.3.0 (from newspaper3k==0.2.8)\n", | ||
| " Downloading pillow-11.3.0-cp310-cp310-win_amd64.whl.metadata (9.2 kB)\n", | ||
| "Collecting PyYAML>=3.11 (from newspaper3k==0.2.8)\n", | ||
| " Downloading PyYAML-6.0.2-cp310-cp310-win_amd64.whl.metadata (2.1 kB)\n", | ||
| "Collecting cssselect>=0.9.2 (from newspaper3k==0.2.8)\n", | ||
| " Downloading cssselect-1.3.0-py3-none-any.whl.metadata (2.6 kB)\n", | ||
| "Collecting lxml>=3.6.0 (from newspaper3k==0.2.8)\n", | ||
| " Downloading lxml-6.0.2-cp310-cp310-win_amd64.whl.metadata (3.7 kB)\n", | ||
| "Collecting nltk>=3.2.1 (from newspaper3k==0.2.8)\n", | ||
| " Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)\n", | ||
| "Requirement already satisfied: requests>=2.10.0 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from newspaper3k==0.2.8) (2.32.5)\n", | ||
| "Collecting feedparser>=5.2.1 (from newspaper3k==0.2.8)\n", | ||
| " Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)\n", | ||
| "Collecting tldextract>=2.0.1 (from newspaper3k==0.2.8)\n", | ||
| " Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)\n", | ||
| "Collecting feedfinder2>=0.0.4 (from newspaper3k==0.2.8)\n", | ||
| " Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)\n", | ||
| " Preparing metadata (setup.py): started\n", | ||
| " Preparing metadata (setup.py): finished with status 'done'\n", | ||
| "Collecting jieba3k>=0.35.1 (from newspaper3k==0.2.8)\n", | ||
| " Downloading jieba3k-0.35.1.zip (7.4 MB)\n", | ||
| " ---------------------------------------- 0.0/7.4 MB ? eta -:--:--\n", | ||
| " ---- ----------------------------------- 0.8/7.4 MB 4.2 MB/s eta 0:00:02\n", | ||
| " ----------- ---------------------------- 2.1/7.4 MB 5.3 MB/s eta 0:00:01\n", | ||
| " ------------------ --------------------- 3.4/7.4 MB 5.6 MB/s eta 0:00:01\n", | ||
| " ------------------------ --------------- 4.5/7.4 MB 5.4 MB/s eta 0:00:01\n", | ||
| " ------------------------------- -------- 5.8/7.4 MB 5.5 MB/s eta 0:00:01\n", | ||
| " ------------------------------------ --- 6.8/7.4 MB 5.4 MB/s eta 0:00:01\n", | ||
| " ---------------------------------------- 7.4/7.4 MB 5.2 MB/s 0:00:01\n", | ||
| " Preparing metadata (setup.py): started\n", | ||
| " Preparing metadata (setup.py): finished with status 'done'\n", | ||
| "Requirement already satisfied: python-dateutil>=2.5.3 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from newspaper3k==0.2.8) (2.9.0.post0)\n", | ||
| "Collecting tinysegmenter==0.3 (from newspaper3k==0.2.8)\n", | ||
| " Downloading tinysegmenter-0.3.tar.gz (16 kB)\n", | ||
| " Preparing metadata (setup.py): started\n", | ||
| " Preparing metadata (setup.py): finished with status 'done'\n", | ||
| "Requirement already satisfied: soupsieve>1.2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from beautifulsoup4>=4.4.1->newspaper3k==0.2.8) (2.8)\n", | ||
| "Requirement already satisfied: typing-extensions>=4.0.0 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from beautifulsoup4>=4.4.1->newspaper3k==0.2.8) (4.14.1)\n", | ||
| "Requirement already satisfied: six in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from feedfinder2>=0.0.4->newspaper3k==0.2.8) (1.17.0)\n", | ||
| "Collecting sgmllib3k (from feedparser>=5.2.1->newspaper3k==0.2.8)\n", | ||
| " Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)\n", | ||
| " Preparing metadata (setup.py): started\n", | ||
| " Preparing metadata (setup.py): finished with status 'done'\n", | ||
| "Collecting click (from nltk>=3.2.1->newspaper3k==0.2.8)\n", | ||
| " Downloading click-8.3.0-py3-none-any.whl.metadata (2.6 kB)\n", | ||
| "Collecting joblib (from nltk>=3.2.1->newspaper3k==0.2.8)\n", | ||
| " Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)\n", | ||
| "Collecting regex>=2021.8.3 (from nltk>=3.2.1->newspaper3k==0.2.8)\n", | ||
| " Downloading regex-2025.9.18-cp310-cp310-win_amd64.whl.metadata (41 kB)\n", | ||
| "Requirement already satisfied: tqdm in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from nltk>=3.2.1->newspaper3k==0.2.8) (4.67.1)\n", | ||
| "Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.10.0->newspaper3k==0.2.8) (3.4.3)\n", | ||
| "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.10.0->newspaper3k==0.2.8) (3.10)\n", | ||
| "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.10.0->newspaper3k==0.2.8) (2.5.0)\n", | ||
| "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from requests>=2.10.0->newspaper3k==0.2.8) (2025.8.3)\n", | ||
| "Collecting requests-file>=1.4 (from tldextract>=2.0.1->newspaper3k==0.2.8)\n", | ||
| " Downloading requests_file-2.1.0-py2.py3-none-any.whl.metadata (1.7 kB)\n", | ||
| "Collecting filelock>=3.0.8 (from tldextract>=2.0.1->newspaper3k==0.2.8)\n", | ||
| " Downloading filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)\n", | ||
| "Requirement already satisfied: colorama in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from click->nltk>=3.2.1->newspaper3k==0.2.8) (0.4.6)\n", | ||
| "Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB)\n", | ||
| "Downloading cssselect-1.3.0-py3-none-any.whl (18 kB)\n", | ||
| "Downloading feedparser-6.0.12-py3-none-any.whl (81 kB)\n", | ||
| "Downloading lxml-6.0.2-cp310-cp310-win_amd64.whl (4.0 MB)\n", | ||
| " ---------------------------------------- 0.0/4.0 MB ? eta -:--:--\n", | ||
| " ------------------ --------------------- 1.8/4.0 MB 10.0 MB/s eta 0:00:01\n", | ||
| " ------------------------------- -------- 3.1/4.0 MB 8.4 MB/s eta 0:00:01\n", | ||
| " ---------------------------------------- 4.0/4.0 MB 7.5 MB/s 0:00:00\n", | ||
| "Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)\n", | ||
| " ---------------------------------------- 0.0/1.5 MB ? eta -:--:--\n", | ||
| " ---------------------------------------- 1.5/1.5 MB 8.0 MB/s 0:00:00\n", | ||
| "Downloading pillow-11.3.0-cp310-cp310-win_amd64.whl (7.0 MB)\n", | ||
| " ---------------------------------------- 0.0/7.0 MB ? eta -:--:--\n", | ||
| " ------- -------------------------------- 1.3/7.0 MB 8.4 MB/s eta 0:00:01\n", | ||
| " ------------------- -------------------- 3.4/7.0 MB 9.6 MB/s eta 0:00:01\n", | ||
| " ------------------------------- -------- 5.5/7.0 MB 9.3 MB/s eta 0:00:01\n", | ||
| " --------------------------------------- 6.8/7.0 MB 8.9 MB/s eta 0:00:01\n", | ||
| " ---------------------------------------- 7.0/7.0 MB 8.4 MB/s 0:00:00\n", | ||
| "Downloading PyYAML-6.0.2-cp310-cp310-win_amd64.whl (161 kB)\n", | ||
| "Downloading regex-2025.9.18-cp310-cp310-win_amd64.whl (276 kB)\n", | ||
| "Downloading tldextract-5.3.0-py3-none-any.whl (107 kB)\n", | ||
| "Downloading filelock-3.19.1-py3-none-any.whl (15 kB)\n", | ||
| "Downloading requests_file-2.1.0-py2.py3-none-any.whl (4.2 kB)\n", | ||
| "Downloading click-8.3.0-py3-none-any.whl (107 kB)\n", | ||
| "Downloading joblib-1.5.2-py3-none-any.whl (308 kB)\n", | ||
| "Building wheels for collected packages: tinysegmenter, feedfinder2, jieba3k, sgmllib3k\n", | ||
| " Building wheel for tinysegmenter (setup.py): started\n", | ||
| " Building wheel for tinysegmenter (setup.py): finished with status 'done'\n", | ||
| " Created wheel for tinysegmenter: filename=tinysegmenter-0.3-py3-none-any.whl size=13667 sha256=b96d7971ba095d143ebc4b574926568d40b52be2a65f9e2644cb4973712381bf\n", | ||
| " Stored in directory: c:\\users\\yunji\\appdata\\local\\pip\\cache\\wheels\\c8\\d6\\6c\\384f58df48c00b9a31d638005143b5b3ac62c3d25fb1447f23\n", | ||
| " Building wheel for feedfinder2 (setup.py): started\n", | ||
| " Building wheel for feedfinder2 (setup.py): finished with status 'done'\n", | ||
| " Created wheel for feedfinder2: filename=feedfinder2-0.0.4-py3-none-any.whl size=3409 sha256=57259ca703bda7c1326164b206623deab4bcaec1b9a5460745a21380093990a4\n", | ||
| " Stored in directory: c:\\users\\yunji\\appdata\\local\\pip\\cache\\wheels\\97\\02\\e7\\a1ff1760e12bdbaab0ac824fae5c1bc933e41c4ccd6a8f8edb\n", | ||
| " Building wheel for jieba3k (setup.py): started\n", | ||
| " Building wheel for jieba3k (setup.py): finished with status 'done'\n", | ||
| " Created wheel for jieba3k: filename=jieba3k-0.35.1-py3-none-any.whl size=7398410 sha256=c83f2b505319d55812c9e5d8a407ed9eb2c6561b96b354620a46fdb843fc3e0c\n", | ||
| " Stored in directory: c:\\users\\yunji\\appdata\\local\\pip\\cache\\wheels\\7a\\c4\\0c\\12a9a314ecac499456c4c3b2fcc2f635a3b45a39dfbd240299\n", | ||
| " Building wheel for sgmllib3k (setup.py): started\n", | ||
| " Building wheel for sgmllib3k (setup.py): finished with status 'done'\n", | ||
| " Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6105 sha256=da9b75b9e1e4557799e40ba61bffcc6cb6c30b31a6d85dc5f50ade5337528fa3\n", | ||
| " Stored in directory: c:\\users\\yunji\\appdata\\local\\pip\\cache\\wheels\\f0\\69\\93\\a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5\n", | ||
| "Successfully built tinysegmenter feedfinder2 jieba3k sgmllib3k\n", | ||
| "Installing collected packages: tinysegmenter, sgmllib3k, jieba3k, regex, PyYAML, Pillow, lxml, joblib, filelock, feedparser, cssselect, click, requests-file, nltk, feedfinder2, tldextract, newspaper3k\n", | ||
| "\n", | ||
| " ---- ----------------------------------- 2/17 [jieba3k]\n", | ||
| " ---- ----------------------------------- 2/17 [jieba3k]\n", | ||
| " ---- ----------------------------------- 2/17 [jieba3k]\n", | ||
| " ---- ----------------------------------- 2/17 [jieba3k]\n", | ||
| " ------- -------------------------------- 3/17 [regex]\n", | ||
| " ----------- ---------------------------- 5/17 [Pillow]\n", | ||
| " ----------- ---------------------------- 5/17 [Pillow]\n", | ||
| " ----------- ---------------------------- 5/17 [Pillow]\n", | ||
| " ----------- ---------------------------- 5/17 [Pillow]\n", | ||
| " -------------- ------------------------- 6/17 [lxml]\n", | ||
| " -------------- ------------------------- 6/17 [lxml]\n", | ||
| " ---------------- ----------------------- 7/17 [joblib]\n", | ||
| " ---------------- ----------------------- 7/17 [joblib]\n", | ||
| " ---------------- ----------------------- 7/17 [joblib]\n", | ||
| " --------------------- ------------------ 9/17 [feedparser]\n", | ||
| " ------------------------- -------------- 11/17 [click]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------ --------- 13/17 [nltk]\n", | ||
| " ------------------------------------- -- 16/17 [newspaper3k]\n", | ||
| " ---------------------------------------- 17/17 [newspaper3k]\n", | ||
| "\n", | ||
| "Successfully installed Pillow-11.3.0 PyYAML-6.0.2 click-8.3.0 cssselect-1.3.0 feedfinder2-0.0.4 feedparser-6.0.12 filelock-3.19.1 jieba3k-0.35.1 joblib-1.5.2 lxml-6.0.2 newspaper3k-0.2.8 nltk-3.9.1 regex-2025.9.18 requests-file-2.1.0 sgmllib3k-1.0.0 tinysegmenter-0.3 tldextract-5.3.0\n" | ||
| ] | ||
| }, | ||
| { | ||
| "name": "stderr", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| " DEPRECATION: Building 'tinysegmenter' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'tinysegmenter'. Discussion can be found at https://github.com/pypa/pip/issues/6334\n", | ||
| " DEPRECATION: Building 'feedfinder2' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'feedfinder2'. Discussion can be found at https://github.com/pypa/pip/issues/6334\n", | ||
| " DEPRECATION: Building 'jieba3k' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'jieba3k'. Discussion can be found at https://github.com/pypa/pip/issues/6334\n", | ||
| " DEPRECATION: Building 'sgmllib3k' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'sgmllib3k'. Discussion can be found at https://github.com/pypa/pip/issues/6334\n" | ||
| ] | ||
| }, | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "Collecting lxml_html_clean\n", | ||
| " Downloading lxml_html_clean-0.4.2-py3-none-any.whl.metadata (2.4 kB)\n", | ||
| "Requirement already satisfied: lxml in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from lxml_html_clean) (6.0.2)\n", | ||
| "Downloading lxml_html_clean-0.4.2-py3-none-any.whl (14 kB)\n", | ||
| "Installing collected packages: lxml_html_clean\n", | ||
| "Successfully installed lxml_html_clean-0.4.2\n", | ||
| "Requirement already satisfied: pandas in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (2.3.2)\n", | ||
| "Requirement already satisfied: numpy>=1.22.4 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2.2.6)\n", | ||
| "Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2.9.0.post0)\n", | ||
| "Requirement already satisfied: pytz>=2020.1 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2025.2)\n", | ||
| "Requirement already satisfied: tzdata>=2022.7 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from pandas) (2025.2)\n", | ||
| "Requirement already satisfied: six>=1.5 in c:\\users\\yunji\\anaconda3\\envs\\myenv\\lib\\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n" | ||
| ] | ||
| } | ||
| ], |
There was a problem hiding this comment.
🛠️ Refactor suggestion
🧩 Analysis chain
Please remove in-notebook pip installs and committed outputs.
- Installing packages inside the notebook and committing the long pip logs makes diffs noisy and harms reproducibility. Prefer requirements.txt/conda env (environment.yml) and keep notebooks output-free in VCS.
- lxml_html_clean is installed but unused.
Apply this minimal change (remove the install lines from the cell):
-!pip install gdeltdoc
-!pip install newspaper3k==0.2.8
-!pip install lxml_html_clean
-!pip install pandasTo verify outputs are cleared and no “pip install” remains:
🏁 Script executed:
#!/bin/bash
# Count total outputs in the notebook (should be 0)
jq '[.cells[].outputs|length]|add' api-practice/newpaper.ipynb
# Find any in-notebook installs (should return nothing)
rg -nP '^\s*!\s*pip\s+install' -- api-practice/newpaper.ipynbLength of output: 1
Clear outputs in api-practice/newpaper.ipynb — remove committed pip logs.
- Verification: jq reports 4 cell outputs remain; ripgrep found no in-notebook '!pip install' lines (install commands removed but output logs are still committed). Clear outputs (e.g. jupyter nbconvert --clear-output --inplace api-practice/newpaper.ipynb or use nbstripout) and commit.
🧰 Tools
🪛 Ruff (0.13.1)
25-25: Loop control variable index not used within loop body
Rename unused index to _index
(B007)
🤖 Prompt for AI Agents
In api-practice/newpaper.ipynb around lines 22 to 215 the notebook still
contains large pip install stdout/stderr outputs (pip logs) committed as cell
outputs; remove those outputs by clearing notebook outputs (e.g. run jupyter
nbconvert --clear-output --inplace api-practice/newpaper.ipynb or use
nbstripout) then verify with jq/ripgrep that cell outputs are empty and no '!pip
install' outputs remain, commit the cleaned notebook, and optionally add
nbstripout or a .gitattributes rule to prevent future committed outputs.
| "execution_count": 4, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "text/html": [ | ||
| "<div>\n", | ||
| "<style scoped>\n", | ||
| " .dataframe tbody tr th:only-of-type {\n", | ||
| " vertical-align: middle;\n", | ||
| " }\n", | ||
| "\n", | ||
| " .dataframe tbody tr th {\n", | ||
| " vertical-align: top;\n", | ||
| " }\n", | ||
| "\n", | ||
| " .dataframe thead th {\n", | ||
| " text-align: right;\n", | ||
| " }\n", | ||
| "</style>\n", | ||
| "<table border=\"1\" class=\"dataframe\">\n", | ||
| " <thead>\n", | ||
| " <tr style=\"text-align: right;\">\n", | ||
| " <th></th>\n", | ||
| " <th>title</th>\n", | ||
| " <th>url</th>\n", | ||
| " <th>text</th>\n", | ||
| " </tr>\n", | ||
| " </thead>\n", | ||
| " <tbody>\n", | ||
| " <tr>\n", | ||
| " <th>0</th>\n", | ||
| " <td>The C . E . O . s Who Just Wont Quit - The New...</td>\n", | ||
| " <td>https://www.nytimes.com/2024/05/09/magazine/fo...</td>\n", | ||
| " <td>Of the many riddles that confront corporate ch...</td>\n", | ||
| " </tr>\n", | ||
| " <tr>\n", | ||
| " <th>1</th>\n", | ||
| " <td>C . Gordon Bell , Creator of a Personal Comput...</td>\n", | ||
| " <td>https://www.nytimes.com/2024/05/21/technology/...</td>\n", | ||
| " <td>C. Gordon Bell, a technology visionary whose c...</td>\n", | ||
| " </tr>\n", | ||
| " <tr>\n", | ||
| " <th>2</th>\n", | ||
| " <td>Wayve , an A . I . Start - Up for Autonomous D...</td>\n", | ||
| " <td>https://www.nytimes.com/2024/05/06/technology/...</td>\n", | ||
| " <td>Wayve, a London maker of artificial intelligen...</td>\n", | ||
| " </tr>\n", | ||
| " <tr>\n", | ||
| " <th>3</th>\n", | ||
| " <td>Biden to Announce A . I . Center in Wisconsin ...</td>\n", | ||
| " <td>https://www.nytimes.com/2024/05/08/us/politics...</td>\n", | ||
| " <td>President Biden on Wednesday announced the cre...</td>\n", | ||
| " </tr>\n", | ||
| " <tr>\n", | ||
| " <th>4</th>\n", | ||
| " <td>How Rich Candidates Burned Cash on Running for...</td>\n", | ||
| " <td>https://www.nytimes.com/2024/05/16/us/politics...</td>\n", | ||
| " <td>The costly realm of campaign politics has clai...</td>\n", | ||
| " </tr>\n", | ||
| " </tbody>\n", | ||
| "</table>\n", | ||
| "</div>" | ||
| ], | ||
| "text/plain": [ | ||
| " title \\\n", | ||
| "0 The C . E . O . s Who Just Wont Quit - The New... \n", | ||
| "1 C . Gordon Bell , Creator of a Personal Comput... \n", | ||
| "2 Wayve , an A . I . Start - Up for Autonomous D... \n", | ||
| "3 Biden to Announce A . I . Center in Wisconsin ... \n", | ||
| "4 How Rich Candidates Burned Cash on Running for... \n", | ||
| "\n", | ||
| " url \\\n", | ||
| "0 https://www.nytimes.com/2024/05/09/magazine/fo... \n", | ||
| "1 https://www.nytimes.com/2024/05/21/technology/... \n", | ||
| "2 https://www.nytimes.com/2024/05/06/technology/... \n", | ||
| "3 https://www.nytimes.com/2024/05/08/us/politics... \n", | ||
| "4 https://www.nytimes.com/2024/05/16/us/politics... \n", | ||
| "\n", | ||
| " text \n", | ||
| "0 Of the many riddles that confront corporate ch... \n", | ||
| "1 C. Gordon Bell, a technology visionary whose c... \n", | ||
| "2 Wayve, a London maker of artificial intelligen... \n", | ||
| "3 President Biden on Wednesday announced the cre... \n", | ||
| "4 The costly realm of campaign politics has clai... " | ||
| ] | ||
| }, | ||
| "execution_count": 4, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], |
There was a problem hiding this comment.
Avoid committing scraped article text in outputs (potential licensing risk).
The rendered table includes “text” content from NYTimes. Commit history will store this third‑party content. Clear outputs and, if you must preview, display only title/url.
Apply this change to the display line to avoid rendering full text in outputs:
-articles_df.head() # 첫 5개의 기사 출력
+articles_df[['title','url']].head()Then clear all notebook outputs before committing (see verification script in prior comment).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "execution_count": 4, | |
| "metadata": {}, | |
| "outputs": [], | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/html": [ | |
| "<div>\n", | |
| "<style scoped>\n", | |
| " .dataframe tbody tr th:only-of-type {\n", | |
| " vertical-align: middle;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe tbody tr th {\n", | |
| " vertical-align: top;\n", | |
| " }\n", | |
| "\n", | |
| " .dataframe thead th {\n", | |
| " text-align: right;\n", | |
| " }\n", | |
| "</style>\n", | |
| "<table border=\"1\" class=\"dataframe\">\n", | |
| " <thead>\n", | |
| " <tr style=\"text-align: right;\">\n", | |
| " <th></th>\n", | |
| " <th>title</th>\n", | |
| " <th>url</th>\n", | |
| " <th>text</th>\n", | |
| " </tr>\n", | |
| " </thead>\n", | |
| " <tbody>\n", | |
| " <tr>\n", | |
| " <th>0</th>\n", | |
| " <td>The C . E . O . s Who Just Wont Quit - The New...</td>\n", | |
| " <td>https://www.nytimes.com/2024/05/09/magazine/fo...</td>\n", | |
| " <td>Of the many riddles that confront corporate ch...</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>1</th>\n", | |
| " <td>C . Gordon Bell , Creator of a Personal Comput...</td>\n", | |
| " <td>https://www.nytimes.com/2024/05/21/technology/...</td>\n", | |
| " <td>C. Gordon Bell, a technology visionary whose c...</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>2</th>\n", | |
| " <td>Wayve , an A . I . Start - Up for Autonomous D...</td>\n", | |
| " <td>https://www.nytimes.com/2024/05/06/technology/...</td>\n", | |
| " <td>Wayve, a London maker of artificial intelligen...</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>3</th>\n", | |
| " <td>Biden to Announce A . I . Center in Wisconsin ...</td>\n", | |
| " <td>https://www.nytimes.com/2024/05/08/us/politics...</td>\n", | |
| " <td>President Biden on Wednesday announced the cre...</td>\n", | |
| " </tr>\n", | |
| " <tr>\n", | |
| " <th>4</th>\n", | |
| " <td>How Rich Candidates Burned Cash on Running for...</td>\n", | |
| " <td>https://www.nytimes.com/2024/05/16/us/politics...</td>\n", | |
| " <td>The costly realm of campaign politics has clai...</td>\n", | |
| " </tr>\n", | |
| " </tbody>\n", | |
| "</table>\n", | |
| "</div>" | |
| ], | |
| "text/plain": [ | |
| " title \\\n", | |
| "0 The C . E . O . s Who Just Wont Quit - The New... \n", | |
| "1 C . Gordon Bell , Creator of a Personal Comput... \n", | |
| "2 Wayve , an A . I . Start - Up for Autonomous D... \n", | |
| "3 Biden to Announce A . I . Center in Wisconsin ... \n", | |
| "4 How Rich Candidates Burned Cash on Running for... \n", | |
| "\n", | |
| " url \\\n", | |
| "0 https://www.nytimes.com/2024/05/09/magazine/fo... \n", | |
| "1 https://www.nytimes.com/2024/05/21/technology/... \n", | |
| "2 https://www.nytimes.com/2024/05/06/technology/... \n", | |
| "3 https://www.nytimes.com/2024/05/08/us/politics... \n", | |
| "4 https://www.nytimes.com/2024/05/16/us/politics... \n", | |
| "\n", | |
| " text \n", | |
| "0 Of the many riddles that confront corporate ch... \n", | |
| "1 C. Gordon Bell, a technology visionary whose c... \n", | |
| "2 Wayve, a London maker of artificial intelligen... \n", | |
| "3 President Biden on Wednesday announced the cre... \n", | |
| "4 The costly realm of campaign politics has clai... " | |
| ] | |
| }, | |
| "execution_count": 4, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| articles_df[['title','url']].head() # 첫 5개의 기사 출력 |
🤖 Prompt for AI Agents
In api-practice/newpaper.ipynb around lines 278–369 the notebook output is
rendering full scraped article "text" (NYTimes) which risks committing
third‑party content; update the display cell to only show title and url (remove
or omit the "text" column from the DataFrame/display call), then clear all
notebook outputs (run "Clear All Outputs" or nbconvert/nbstripout) and re-run
the verification script referenced in the prior comment to confirm no outputs
remain before committing.
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# 여러 페이지에서 정보 추출\n", | ||
| "data = []\n", | ||
| "for page_num in range(1, 4): # 1~3페이지 크롤링\n", | ||
| " url = f\"https://kin.naver.com/search/list.naver?query=%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90&page={page_num}\"\n", | ||
| " response = requests.get(url)\n", | ||
| " html = response.text\n", | ||
| " soup = BeautifulSoup(html, 'html.parser')\n", | ||
| " trees = soup.select(\".basic1 > li > dl\")\n", | ||
| " \n", | ||
| " for tree in trees:\n", | ||
| " ################\n", | ||
| " \n", | ||
| " # 데이터를 리스트에 추가\n", | ||
| " data.append([title, link, date, category, hit])\n", | ||
| "\n", | ||
| "# DataFrame으로 변환\n", | ||
| "import pandas as pd\n", | ||
| "df = " | ||
| ] |
There was a problem hiding this comment.
Complete the pagination block and build the DataFrame.
Currently appends undefined variables.
data = []
for page_num in range(1, 4): # 1~3페이지 크롤링
url = f"https://kin.naver.com/search/list.naver?query=%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90&page={page_num}"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
trees = soup.select(".basic1 > li > dl")
for tree in trees:
- ################
-
- # 데이터를 리스트에 추가
- data.append([title, link, date, category, hit])
+ a = tree.select_one("dt > a[href]") or tree.select_one('a[class*="kin.txt"]')
+ title = a.get_text(strip=True) if a else ""
+ link = a["href"] if a and a.has_attr("href") else ""
+ date_el = tree.select_one(".txt_inline")
+ date = date_el.get_text(" ", strip=True) if date_el else ""
+ cat_el = tree.select_one('a[class*="kin.cat2"]') or tree.select_one("dd .txt_block a")
+ category = cat_el.get_text(strip=True) if cat_el else ""
+ hit_el = tree.select_one(".hit")
+ hit = re.sub(r"[^0-9]", "", hit_el.get_text()) if hit_el else "0"
+ data.append([title, link, date, category, hit])
# DataFrame으로 변환
import pandas as pd
-df =
+df = pd.DataFrame(data, columns=["title", "link", "date", "category", "hit"])Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.13.1)
200-200: Undefined name null
(F821)
🤖 Prompt for AI Agents
In static-crawling/jisigin.py around lines 199 to 222, the pagination loop
currently appends undefined variables (title, link, date, category, hit) and
does not build the DataFrame; update the loop to extract each field from each
`tree` (e.g., select title text, href for link, parse date text, category text,
and hit count or default), handle missing elements with safe defaults, append a
fully-populated row like [title, link, date, category, hit] for each item, and
after the loop import pandas and create `df = pd.DataFrame(data, columns=[...])`
(optionally set dtypes and save/export as needed).
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (7)
dynamic-crawling/yanolja.py (7)
6-8: Add explicit wait imports for reliable loading.Apply this diff:
from selenium import webdriver from selenium.webdriver.common.by import By +from selenium.webdriver.support.ui import WebDriverWait +from selenium.webdriver.support import expected_conditions as EC import time
17-19: Replace sleep with explicit wait.More reliable than fixed delays.
Apply this diff:
-# 페이지 로딩을 위해 대기 -time.sleep(3) +# 페이지 로딩: 리뷰 컨테이너가 나타날 때까지 대기 +WebDriverWait(driver, 10).until( + EC.presence_of_element_located((By.CSS_SELECTOR, ".review-item-container")) +)
38-45: Brittle CSS selectors.Classes like
.review-item-containeron Yanolja are likely ephemeral. Prefer stable attributes (e.g., data-*, aria-labels, semantic roles) or text anchors to reduce breakage.
61-66: Ensure ratings and reviews align before zipping.Prevent silent row drops if lengths differ.
Apply this diff:
-# 별점과 리뷰를 결합하여 리스트 생성 -data = list(zip(ratings, reviews)) +# 별점과 리뷰를 결합하기 전에 길이 검증 +assert len(ratings) == len(reviews), f"Mismatched lengths: ratings={len(ratings)}, reviews={len(reviews)}" +data = list(zip(ratings, reviews))
97-101: Write reviews and summary to separate Excel sheets.Concatenating heterogeneous DataFrames creates sparse columns. Use multi-sheet output.
Apply this diff:
-# 최종 DataFrame 결합 -final_df = pd.concat([df_reviews, summary_df], ignore_index=True) - -# Excel 파일로 저장 -final_df.to_excel('yanolja.xlsx', index = False) +# Excel 파일로 저장(시트 분리) +with pd.ExcelWriter('yanolja.xlsx', engine='openpyxl') as writer: + df_reviews.to_excel(writer, sheet_name='reviews', index=False) + summary_df.to_excel(writer, sheet_name='summary', index=False)
10-16: Always quit the driver via try/finally.Prevents leaked Chrome processes on exceptions.
Example:
driver = webdriver.Chrome() try: driver.get(url) # ... rest of scraping ... finally: driver.quit()Also applies to: 103-104
13-16: Check site ToS/robots and throttle requests.Be respectful of Yanolja’s policies; consider random sleeps/backoff and a user-agent.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dynamic-crawling/yanolja.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
dynamic-crawling/yanolja.py
1-1: Expected a statement
(invalid-syntax)
1-1: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
1-1: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
2-2: Expected a statement
(invalid-syntax)
2-2: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
2-2: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
3-3: Expected a statement
(invalid-syntax)
3-3: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
3-3: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
4-4: Expected a statement
(invalid-syntax)
4-4: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
4-4: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
🔇 Additional comments (1)
dynamic-crawling/yanolja.py (1)
21-27: Delete unused/broken Selenium query (wrong API signature).
find_elements("css selector", ...)is invalid and currently unused; it will raise at runtime.Apply this diff:
for _ in range(scroll_count): ######## your code here ######## driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") - # 리뷰 요소 가져오기 (예시: 리뷰 텍스트가 들어있는 div 태그) - reviews = driver.find_elements("css selector", ".css-vjs6b8") time.sleep(1) # 스크롤 이후 대기Likely an incorrect or invalid review comment.
| !pip install selenium | ||
| !pip install bs4 | ||
| !pip install pandas | ||
| !pip install openpyxl |
There was a problem hiding this comment.
Remove notebook-style pip commands (syntax error in .py).
These lines break the script (see Ruff invalid-syntax). Manage dependencies via requirements.txt, not in-code.
Apply this diff:
-!pip install selenium
-!pip install bs4
-!pip install pandas
-!pip install openpyxlOptionally add a requirements.txt:
selenium>=4.23
beautifulsoup4>=4.12
pandas>=2.2
openpyxl>=3.1
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| !pip install selenium | |
| !pip install bs4 | |
| !pip install pandas | |
| !pip install openpyxl |
🧰 Tools
🪛 Ruff (0.13.1)
1-1: Expected a statement
(invalid-syntax)
1-1: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
1-1: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
2-2: Expected a statement
(invalid-syntax)
2-2: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
2-2: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
3-3: Expected a statement
(invalid-syntax)
3-3: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
3-3: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
4-4: Expected a statement
(invalid-syntax)
4-4: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
4-4: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
🤖 Prompt for AI Agents
In dynamic-crawling/yanolja.py around lines 1 to 4, the notebook-style "!pip
install ..." lines cause invalid Python syntax and must be removed; delete those
lines and instead add a requirements.txt (suggested contents: selenium>=4.23,
beautifulsoup4>=4.12, pandas>=2.2, openpyxl>=3.1) at the repo root, and update
project docs/README or CI to run pip install -r requirements.txt so dependencies
are managed outside the .py file.
| for review_container in reviews_class: | ||
| star_container = review_container.select_one(".css-rz7kwu") | ||
| if not star_container: | ||
| ratings.append(0) | ||
| continue | ||
| stars = star_container.find_all("svg") | ||
| filled_stars = sum( | ||
| 1 for star in stars if not (star.find("path") and star.find("path").get("fill-rule") == "evenodd") | ||
| ) #stars 안의 각 star에 대해 path의 fill-rule 속성이 evenodd이면 1을 더함 | ||
| ratings.append(filled_stars) | ||
|
|
There was a problem hiding this comment.
Star-count logic inverted vs comment.
The comment says “count when fill-rule == evenodd”, but code counts when it’s NOT. Fix to avoid wrong ratings.
Apply this diff:
- stars = star_container.find_all("svg")
- filled_stars = sum(
- 1 for star in stars if not (star.find("path") and star.find("path").get("fill-rule") == "evenodd")
- ) #stars 안의 각 star에 대해 path의 fill-rule 속성이 evenodd이면 1을 더함
+ stars = star_container.find_all("svg")
+ filled_stars = sum(
+ 1 for star in stars
+ if (star.find("path") and star.find("path").get("fill-rule") == "evenodd")
+ ) # fill-rule == "evenodd" 인 경우만 카운트📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for review_container in reviews_class: | |
| star_container = review_container.select_one(".css-rz7kwu") | |
| if not star_container: | |
| ratings.append(0) | |
| continue | |
| stars = star_container.find_all("svg") | |
| filled_stars = sum( | |
| 1 for star in stars if not (star.find("path") and star.find("path").get("fill-rule") == "evenodd") | |
| ) #stars 안의 각 star에 대해 path의 fill-rule 속성이 evenodd이면 1을 더함 | |
| ratings.append(filled_stars) | |
| for review_container in reviews_class: | |
| star_container = review_container.select_one(".css-rz7kwu") | |
| if not star_container: | |
| ratings.append(0) | |
| continue | |
| stars = star_container.find_all("svg") | |
| filled_stars = sum( | |
| 1 for star in stars | |
| if (star.find("path") and star.find("path").get("fill-rule") == "evenodd") | |
| ) # fill-rule == "evenodd" 인 경우만 카운트 | |
| ratings.append(filled_stars) |
🤖 Prompt for AI Agents
In dynamic-crawling/yanolja.py around lines 48 to 58, the star-count logic
currently increments when the path's fill-rule is NOT "evenodd", which
contradicts the comment — change the condition so it counts a star only when the
path exists and path.get("fill-rule") == "evenodd". Update the generator
expression to explicitly check equality (safely handling missing path or
attribute) and append that count as before.
| # 평균 별점 계산 | ||
| average_rating = sum(ratings) / len(ratings) | ||
|
|
There was a problem hiding this comment.
Guard against division by zero when no ratings found.
Apply this diff:
-# 평균 별점 계산
-average_rating = sum(ratings) / len(ratings)
+# 평균 별점 계산
+average_rating = round(sum(ratings) / len(ratings), 2) if ratings else None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 평균 별점 계산 | |
| average_rating = sum(ratings) / len(ratings) | |
| # 평균 별점 계산 | |
| average_rating = round(sum(ratings) / len(ratings), 2) if ratings else None |
🤖 Prompt for AI Agents
In dynamic-crawling/yanolja.py around lines 67 to 69, guard against division by
zero when computing average_rating by checking if ratings is empty before
dividing; if len(ratings) == 0 set average_rating to a safe default (e.g., 0 or
None) otherwise compute sum(ratings) / len(ratings), and optionally add a brief
log or comment to clarify the chosen default.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (7)
static-crawling/static-crawling_assignment.py (7)
5-7: Add a browser-like User-Agent (and optionaltime) for polite, reliable requests.from bs4 import BeautifulSoup import requests + +import time + +# Use a realistic UA to avoid being blocked by basic bot filters. +HEADERS = { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36" +}
8-13: Drop the initial one-off request; it’s redundant.You fetch pages again below; this block isn’t used after cleanup.
-# 알라딘 베스트셀러 페이지 URL -url = "https://www.aladin.co.kr/shop/common/wbest.aspx?BestType=Bestseller&BranchType=1&CID=0&page=1&cnt=1000&SortOrder=1" -response = requests.get(url) # 요청 보내기 -html = response.text # 응답 받은 HTML 문서 -soup = BeautifulSoup(html, "html.parser") # BeautifulSoup으로 파싱
14-31: Remove single-item scrape and prints.Not robust (no error handling) and duplicated by the loop below.
-tree = soup.select_one("div.ss_book_box") - -# 제목과 링크 추출 -title_tag = tree.select_one(".bo3") -link = title_tag.get("href") -title = title_tag.text - -# 할인가와 별점 추출 -price_tag = tree.select_one(".ss_p2") -price = price_tag.text -review_tag = tree.select_one(".star_score") -print("제목:",title) -print("링크:", link) - -#price_tag.text, review_tag.text -review = review_tag.text -print("할인가:",price) -print("별점:",review)
33-43: Remove debug print loop (or move behind a debug flag).Avoid noisy console output and duplicated work; the data collection loop below suffices.
-trees = soup.select("div.ss_book_box") -for tree in trees: - try: - title = tree.select_one(".bo3") - title_text = title.text - title_link = title.attrs['href'] - price = (tree.select_one(".ss_p2")).text - review = (tree.select_one(".star_score")).text - print(title_text, title_link, price, review) - except: continue
53-63: Avoid bareexcept, sanitize text, and handle missing nodes.This removes E701, improves robustness, and cleans values.
trees = soup.select("div.ss_book_box") for tree in trees: - try: - title = tree.select_one(".bo3") - title_text = title.text - title_link = title.attrs['href'] - price = (tree.select_one(".ss_p2")).text - review = (tree.select_one(".star_score")).text - - datas.append([title_text, title_link, price, review]) - - except: continue + try: + title_tag = tree.select_one(".bo3") + title_text = title_tag.get_text(strip=True) + title_link = title_tag.get('href', '') + price = tree.select_one(".ss_p2").get_text(strip=True) + review_tag = tree.select_one(".star_score") + review = review_tag.get_text(strip=True) if review_tag else "" + datas.append([title_text, title_link, price, review]) + except (AttributeError, KeyError): + continue
65-68: Add CSV encoding for Korean (Excel compatibility).# csv 파일로 저장해 봅시다. -df.to_csv('static-crawling_assignment.csv', index = False) +df.to_csv('static-crawling_assignment.csv', index=False, encoding='utf-8-sig')
46-48: Crawl responsibly (robots/ToS, throttle, page size).Confirm permission per robots.txt/ToS; keep throttling; consider reducing cnt=1000 to limit load.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
static-crawling/static-crawling_assignment.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
static-crawling/static-crawling_assignment.py
1-1: Expected a statement
(invalid-syntax)
1-1: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
1-1: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
2-2: Expected a statement
(invalid-syntax)
2-2: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
2-2: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
3-3: Expected a statement
(invalid-syntax)
3-3: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
3-3: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
42-42: Multiple statements on one line (colon)
(E701)
63-63: Multiple statements on one line (colon)
(E701)
| !pip install bs4 | ||
| !pip install requests | ||
| !pip install pandas |
There was a problem hiding this comment.
Remove Jupyter magics (!pip ...): invalid in .py.
These lines break execution; manage dependencies outside the script (e.g., requirements.txt).
-!pip install bs4
-!pip install requests
-!pip install pandas📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| !pip install bs4 | |
| !pip install requests | |
| !pip install pandas |
🧰 Tools
🪛 Ruff (0.13.1)
1-1: Expected a statement
(invalid-syntax)
1-1: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
1-1: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
2-2: Expected a statement
(invalid-syntax)
2-2: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
2-2: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
3-3: Expected a statement
(invalid-syntax)
3-3: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
3-3: Simple statements must be separated by newlines or semicolons
(invalid-syntax)
🤖 Prompt for AI Agents
In static-crawling/static-crawling_assignment.py around lines 1 to 3, the file
contains Jupyter magic shell commands (!pip install ...) which are invalid in a
.py module; remove these lines and manage dependencies externally (e.g., add
required packages to requirements.txt or install via environment/CI) so the
script contains only valid Python import and code.
| datas = [] | ||
| for page_num in range(1, 4): | ||
| url = f"https://www.aladin.co.kr/shop/common/wbest.aspx?BestType=Bestseller&BranchType=1&CID=0&page=1&cnt=1000&SortOrder={page_num}" | ||
| response = requests.get(url) | ||
| html = response.text | ||
| soup = BeautifulSoup(html, 'html.parser') | ||
| trees = soup.select("div.ss_book_box") |
There was a problem hiding this comment.
Fix pagination bug: you’re varying SortOrder, not page. Also add headers, timeout, status check, and a short delay.
datas = []
for page_num in range(1, 4):
- url = f"https://www.aladin.co.kr/shop/common/wbest.aspx?BestType=Bestseller&BranchType=1&CID=0&page=1&cnt=1000&SortOrder={page_num}"
- response = requests.get(url)
+ url = f"https://www.aladin.co.kr/shop/common/wbest.aspx?BestType=Bestseller&BranchType=1&CID=0&page={page_num}&cnt=1000&SortOrder=1"
+ response = requests.get(url, headers=HEADERS, timeout=10)
+ response.raise_for_status()
html = response.text
soup = BeautifulSoup(html, 'html.parser')
+ time.sleep(1) # be politeCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In static-crawling/static-crawling_assignment.py around lines 46 to 52, the loop
incorrectly varies the SortOrder parameter instead of the page parameter and
lacks request hardening; change the URL to vary page (use the loop variable in
the page=... query), add a headers dict with a realistic User-Agent, pass a
timeout to requests.get, check response.status_code (or call
response.raise_for_status()) before parsing, and add a short delay (e.g.,
time.sleep(1)) between requests to avoid hammering the server.
| for tree in trees: | ||
| try: | ||
| title = tree.select_one(".bo3") | ||
| title_text = title.text | ||
| title_link = title.attrs['href'] | ||
| price = (tree.select_one(".ss_p2")).text | ||
| review = (tree.select_one(".star_score")).text | ||
|
|
||
| datas.append([title_text, title_link, price, review]) | ||
|
|
||
| except: continue |
There was a problem hiding this comment.
try/except 를 쓰면 항목을 통째로 건너뛸 수 있어서 대신 어떤 값이 없으면 None을 넣도록 하는 것은 어떨까요?
| final_df.to_excel('yanolja.xlsx', index = False) | ||
|
|
||
| # 드라이버 종료 | ||
| driver.quit() No newline at end of file |
Summary by CodeRabbit
New Features
Chores