Conversation
WalkthroughAdds new web crawling scripts and notebooks: static crawlers for Aladin and Naver Jisigin, a dynamic Selenium-based Yanolja review crawler, and a GDELT+Newspaper3k article fetcher. Aggregates parsed data into pandas DataFrames and persists to CSV/Excel. Also removes a static-crawling instructional notebook. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Requests as HTTP Client
participant AladinSite as Aladin Website
participant BS as BeautifulSoup Parser
participant PD as pandas
participant FS as File System
User->>Requests: GET bestseller?page=1..3
Requests-->>AladinSite: HTTP requests
AladinSite-->>Requests: HTML responses
Requests-->>BS: HTML content
BS-->>PD: Extract rows (title, link, price, rating)
PD-->>FS: to_csv("aladin_crawling.csv")
Note over PD,FS: Static crawl aggregation and export
sequenceDiagram
autonumber
actor User
participant Requests as HTTP Client
participant Jisigin as Naver Jisigin
participant BS as BeautifulSoup
participant PD as pandas
participant FS as File System
loop pages 1..3
User->>Requests: GET search page
Requests-->>Jisigin: HTTP
Jisigin-->>Requests: HTML
Requests-->>BS: Parse results
BS-->>PD: Append (title, link, date, category, hit)
end
PD-->>FS: to_excel("jisigin.xlsx")
sequenceDiagram
autonumber
actor User
participant Selenium as Selenium WebDriver
participant Yanolja as Yanolja Reviews Page
participant BS as BeautifulSoup
participant PD as pandas
participant FS as File System
User->>Selenium: Launch Chrome, get(url)
loop Scroll to load more
Selenium->>Yanolja: execute_script(scroll)
Yanolja-->>Selenium: Updated DOM
end
Selenium-->>BS: page_source
BS-->>PD: Extract reviews & derive ratings
PD-->>FS: to_excel("yanolja.xlsx")
Selenium-->>User: Quit
sequenceDiagram
autonumber
actor User
participant GDELT as GDELT API
participant News as News Sites
participant NP as Newspaper3k
participant PD as pandas
participant FS as File System
User->>GDELT: article_search(filters)
GDELT-->>User: Article metadata (titles, URLs)
loop for each URL
User->>NP: download(url), parse()
NP-->>User: title, text
end
User->>PD: Build DataFrame (title, url, text)
PD-->>FS: to_csv("articles_data.csv")
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (29)
2_gyum/aladin.py (6)
36-39: Add timeout and status check to HTTP request; drop useless expression.Prevents hangs and surfaces HTTP errors. The bare
soupexpression is no-op in scripts.Apply this diff:
-response = requests.get(url) # 요청 보내기 -html = response.text # 응답 받은 HTML 문서 -soup = BeautifulSoup(html,'html.parser') # BeautifulSoup으로 파싱 -soup +response = requests.get(url, timeout=10) # 요청 보내기 +response.raise_for_status() +html = response.text # 응답 받은 HTML 문서 +soup = BeautifulSoup(html, 'html.parser') # BeautifulSoup으로 파싱
51-53: Remove no-op expression.
treealone does nothing; keep only the assignment or print explicitly.Apply this diff:
-tree = soup.select_one('#Myform > .ss_book_box') -tree +tree = soup.select_one('#Myform > .ss_book_box')
96-107: Replace bare except with specific/logged handling.Bare
except: continuehides bugs and makes debugging hard.Apply this diff:
-for tree in trees: - try: - title = tree.select_one('.bo3') - title_text = title.text - title_link = title.attrs['href'] - - price = tree.select_one(".ss_p2").text - review = tree.select_one(".star_score").text - - print(title_text, title_link, price, review) - except: continue +for tree in trees: + try: + title = tree.select_one('.bo3') + title_text = title.text + title_link = title.attrs['href'] + price = tree.select_one(".ss_p2").text + review = tree.select_one(".star_score").text + print(title_text, title_link, price, review) + except AttributeError as e: + # 일부 요소가 없는 카드 스킵 + continue + except Exception as e: + # 예기치 못한 예외 로깅 후 스킵 + print(f"skip item due to error: {e}") + continue
129-133: Add timeout and status check for paginated requests.Apply this diff:
- response = requests.get(url) # 요청 보내기 - html = response.text # 응답 받은 HTML 문서 + response = requests.get(url, timeout=10) # 요청 보내기 + response.raise_for_status() + html = response.text # 응답 받은 HTML 문서
146-146: Avoid bareexcepthere as well.Apply this diff:
- except: continue + except AttributeError: + continue + except Exception as e: + print(f"skip item due to error: {e}") + continue
148-151: Drop no-op expression; optionally show a small preview.Apply this diff:
-df = pd.DataFrame(datas,columns=['title_text','title_link','price','review']) -df +df = pd.DataFrame(datas, columns=['title_text','title_link','price','review']) +print(df.shape)2_gyum/newpaper.py (2)
63-66: Rename unused loop index.Apply this diff:
-for index, row in articles.iterrows(): +for _idx, row in articles.iterrows():
67-79: Harden article fetch with error handling (network/parser failures are common).Apply this diff:
- # Newspaper3k 라이브러리를 사용하여 기사 본문 추출 - article = Article(url) - article.download() - article.parse() - text = article.text # 기사 본문 - - # DataFrame에 추가 - articles_data.append({ - "title": title, - "url": url, - "text": text - }) + # Newspaper3k 라이브러리를 사용하여 기사 본문 추출 + try: + article = Article(url, language='en') + article.download() + article.parse() + text = article.text # 기사 본문 + except Exception as e: + print(f"skip {url}: {e}") + continue + + # DataFrame에 추가 + articles_data.append({"title": title, "url": url, "text": text})2_gyum/yanolja.py (3)
28-36: Prefer explicit waits and (optionally) headless/options for robustness.Sleeping is flaky; waiting for elements is more stable. Headless is CI-friendly.
Example (outside this hunk):
from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver.get(url) WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".css-1kpa3g > p")))Optionally use webdriver-manager or set ChromeOptions headless.
119-125: Guard against length mismatch between ratings and reviews.Zipping silently drops extras; trim or assert.
Apply this diff:
-# 별점과 리뷰를 결합하여 리스트 생성 -data = list(zip(ratings, reviews)) +# 별점과 리뷰를 결합하여 리스트 생성 +if len(ratings) != len(reviews): + min_len = min(len(ratings), len(reviews)) + ratings, reviews = ratings[:min_len], reviews[:min_len] +data = list(zip(ratings, reviews))
207-209: Ensure driver quits on failure.Wrap critical steps in try/finally to avoid orphaned Chrome processes.
Example (outside this hunk):
driver = webdriver.Chrome() try: # ... crawling logic ... final_df.to_excel('yanolja.xlsx', index=False) finally: driver.quit()2_gyum/jisigin.py (5)
38-42: Add timeout/status check; drop no-op expression.Apply this diff:
-response = requests.get(url) # 요청 보내기 -html = response.text # 응답 받은 HTML 문서 -soup = BeautifulSoup(html,'html.parser') # BeautifulSoup으로 파싱 -soup +response = requests.get(url, timeout=10) # 요청 보내기 +response.raise_for_status() +html = response.text # 응답 받은 HTML 문서 +soup = BeautifulSoup(html, 'html.parser') # BeautifulSoup으로 파싱
54-56: Remove no-op expression.Apply this diff:
-tree = soup.select_one('.basic1 > li > dl') -tree # 첫 번째 질문의 HTML 구조를 출력하여 확인 +tree = soup.select_one('.basic1 > li > dl')
98-101: Make hit parsing resilient.Splitting and indexing can crash if format changes.
Apply this diff:
-texts = hit_tag.text -hit = texts.split()[1] +texts = hit_tag.text +parts = [p for p in texts.split() if p.isdigit()] +hit = parts[0] if parts else ""Alternatively use regex to extract digits.
139-143: Add timeout/status check in pagination loop.Apply this diff:
- response = requests.get(url) + response = requests.get(url, timeout=10) + response.raise_for_status()
156-159: Drop no-op expression or print a preview.Apply this diff:
-df = pd.DataFrame(data,columns=['title','link','date','category','hit']) - -df +df = pd.DataFrame(data, columns=['title','link','date','category','hit']) +print(df.shape)dynamic-crawling/yanolja.ipynb (4)
22-67: Clear notebook outputs before committing.Huge stored outputs bloat diffs and repo size. Re-run or use “Clear All Outputs” before save.
82-95: Prefer WebDriverWait over sleep for stability.Replace sleeps with waits for elements/scroll completion.
Example (outside this hunk):
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Wait for reviews to appear WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".css-1kpa3g > p")))
768-778: Guard against ratings/reviews length mismatch before DataFrame.Apply this diff:
-# DataFrame으로 변환 -df_reviews = pd.DataFrame(data, columns=['Rating', 'Review']) -df_reviews.index += 1 +# DataFrame으로 변환 +df_reviews = pd.DataFrame(data, columns=['Rating', 'Review']) +df_reviews.index += 1And add (outside this hunk):
if len(ratings) != len(reviews): min_len = min(len(ratings), len(reviews)) ratings, reviews = ratings[:min_len], reviews[:min_len]
1057-1060: Remove teaching placeholder once filled.
######## your code here ########can be dropped now.Apply this diff:
-######## your code here ######## final_df.to_excel('yanolja.xlsx',index=False)static-crawling/jisigin.ipynb (9)
28-52: Strip large notebook outputs before committing (repo bloat).The rendered HTML and pip logs make the notebook heavy and noisy in diffs. Commit with outputs cleared (e.g., nbstripout) to keep the repo lean.
54-58: Avoid installing packages inside notebooks.Prefer a requirements.txt/pyproject.toml and a reproducible environment over in-notebook pip installs.
1049-1050: Don’t print the entire DOM; limit or format output.Reduces diff noise and avoids Ruff B018 “useless expression.”
Apply this diff:
-soup +print(soup.prettify()[:2000])
1098-1100: Guard against no-result pages.
soup.select_onemay return None; accessing it later will crash.Apply this diff:
-tree = soup.select_one('.basic1 > li > dl') +tree = soup.select_one('.basic1 > li > dl') +if tree is None: + raise ValueError("No results found for the query.")
1131-1134: Simplify selector for the title anchor.More robust and readable than escaping colons.
Apply this diff:
-title_tag = tree.select_one("._nclicks\\:kin\\.txt._searchListTitleAnchor") +title_tag = tree.select_one("dt > a._searchListTitleAnchor") title = title_tag.text link = title_tag.attrs['href']
1191-1196: Fix terminology: this is answer count, not view count.The
.hitelement contains “답변수 N”.Apply this diff:
-# 조회수 추출 +# 답변수 추출 hit_tag = tree.select_one('.hit') texts = hit_tag.text hit = texts.split()[1] print(hit)
1232-1243: Be defensive when extracting per-item fields.Any missing element (title/category/hit) will raise. Skip incomplete items.
Apply this diff:
for tree in trees: - title = tree.select_one("._nclicks\\:kin\\.txt").text - link = tree.select_one("._nclicks\\:kin\\.txt").attrs['href'] - date = tree.select_one(".txt_inline").text - category = tree.select_one("._nclicks\\:kin\\.cat2").text - hit = tree.select_one(".hit").text.split()[1] + title_el = tree.select_one("dt > a._searchListTitleAnchor") + date_el = tree.select_one(".txt_inline") + category_el = tree.select_one("._nclicks\\:kin\\.cat2") + hit_el = tree.select_one(".hit") + if not (title_el and date_el and category_el and hit_el): + continue + title = title_el.get_text(strip=True) + link = title_el['href'] + date = date_el.get_text(strip=True) + category = category_el.get_text(strip=True) + hit = hit_el.get_text(strip=True).split()[1]
1638-1661: Harden multi-page crawl: session, headers, timeouts, error handling, and polite delay.Improves reliability and reduces block risk.
Apply this diff:
-# 여러 페이지에서 정보 추출 -data = [] -for page_num in range(1, 4): # 1~3페이지 크롤링 - url = f"https://kin.naver.com/search/list.naver?query=%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90&page={page_num}" - response = requests.get(url) - html = response.text - soup = BeautifulSoup(html, 'html.parser') - trees = soup.select(".basic1 > li > dl") - - for tree in trees: - ################ - title = tree.select_one("._nclicks\\:kin\\.txt").text - link = tree.select_one("._nclicks\\:kin\\.txt").attrs['href'] - date = tree.select_one(".txt_inline").text - category = tree.select_one("._nclicks\\:kin\\.cat2").text - hit = tree.select_one(".hit").text.split()[1] - # 데이터를 리스트에 추가 - data.append([title, link, date, category, hit]) +# 여러 페이지에서 정보 추출 +import time +data = [] +session = requests.Session() +headers = { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" +} +for page_num in range(1, 4): # 1~3페이지 크롤링 + url = f"https://kin.naver.com/search/list.naver?query=%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90&page={page_num}" + resp = session.get(url, headers=headers, timeout=10) + resp.raise_for_status() + html = resp.text + soup = BeautifulSoup(html, 'html.parser') + trees = soup.select(".basic1 > li > dl") + + for tree in trees: + title_el = tree.select_one("dt > a._searchListTitleAnchor") + date_el = tree.select_one(".txt_inline") + category_el = tree.select_one("._nclicks\\:kin\\.cat2") + hit_el = tree.select_one(".hit") + if not (title_el and date_el and category_el and hit_el): + continue + title = title_el.get_text(strip=True) + link = title_el['href'] + date = date_el.get_text(strip=True) + category = category_el.get_text(strip=True) + hit_txt = hit_el.get_text(strip=True) + # "답변수 10" -> 10 (digits only) + hit = "".join(ch for ch in hit_txt if ch.isdigit()) + data.append([title, link, date, category, hit]) + time.sleep(1) # polite delay
1680-1683: Consider normalizing dtypes before saving.Converting date to datetime and hit to numeric makes the Excel more useful downstream.
Example (apply before to_excel):
df['date'] = pd.to_datetime(df['date'], errors='coerce', format='%Y.%m.%d.') df['hit'] = pd.to_numeric(df['hit'], errors='coerce').astype('Int64')
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
api-practice/articles_data.csvis excluded by!**/*.csvdynamic-crawling/yanolja.xlsxis excluded by!**/*.xlsxstatic-crawling/aladin_crawling.csvis excluded by!**/*.csvstatic-crawling/jisigin.xlsxis excluded by!**/*.xlsx
📒 Files selected for processing (7)
2_gyum/aladin.py(1 hunks)2_gyum/jisigin.py(1 hunks)2_gyum/newpaper.py(1 hunks)2_gyum/yanolja.py(1 hunks)dynamic-crawling/yanolja.ipynb(13 hunks)static-crawling/jisigin.ipynb(9 hunks)static-crawling/static-crawling_assignment.ipynb(0 hunks)
💤 Files with no reviewable changes (1)
- static-crawling/static-crawling_assignment.ipynb
🧰 Additional context used
🪛 Ruff (0.13.1)
2_gyum/yanolja.py
1-1: Shebang is present but file is not executable
(EXE001)
15-15: Undefined name get_ipython
(F821)
16-16: Undefined name get_ipython
(F821)
17-17: Undefined name get_ipython
(F821)
18-18: Undefined name get_ipython
(F821)
78-78: Found useless expression. Either assign it to a variable or remove it.
(B018)
108-108: Found useless expression. Either assign it to a variable or remove it.
(B018)
170-170: Found useless expression. Either assign it to a variable or remove it.
(B018)
187-187: Found useless expression. Either assign it to a variable or remove it.
(B018)
static-crawling/jisigin.ipynb
84-84: Found useless expression. Either assign it to a variable or remove it.
(B018)
2_gyum/newpaper.py
1-1: Shebang is present but file is not executable
(EXE001)
14-14: Undefined name get_ipython
(F821)
15-15: Undefined name get_ipython
(F821)
16-16: Undefined name get_ipython
(F821)
17-17: Undefined name get_ipython
(F821)
63-63: Loop control variable index not used within loop body
Rename unused index to _index
(B007)
2_gyum/aladin.py
1-1: Shebang is present but file is not executable
(EXE001)
17-17: Undefined name get_ipython
(F821)
18-18: Undefined name get_ipython
(F821)
19-19: Undefined name get_ipython
(F821)
36-36: Probable use of requests call without timeout
(S113)
39-39: Found useless expression. Either assign it to a variable or remove it.
(B018)
52-52: Found useless expression. Either assign it to a variable or remove it.
(B018)
107-107: Do not use bare except
(E722)
107-107: try-except-continue detected, consider logging the exception
(S112)
107-107: Multiple statements on one line (colon)
(E701)
129-129: Probable use of requests call without timeout
(S113)
146-146: Do not use bare except
(E722)
146-146: try-except-continue detected, consider logging the exception
(S112)
146-146: Multiple statements on one line (colon)
(E701)
149-149: Found useless expression. Either assign it to a variable or remove it.
(B018)
2_gyum/jisigin.py
1-1: Shebang is present but file is not executable
(EXE001)
18-18: Undefined name get_ipython
(F821)
19-19: Undefined name get_ipython
(F821)
20-20: Undefined name get_ipython
(F821)
21-21: Undefined name get_ipython
(F821)
38-38: Probable use of requests call without timeout
(S113)
41-41: Found useless expression. Either assign it to a variable or remove it.
(B018)
55-55: Found useless expression. Either assign it to a variable or remove it.
(B018)
139-139: Probable use of requests call without timeout
(S113)
158-158: Found useless expression. Either assign it to a variable or remove it.
(B018)
dynamic-crawling/yanolja.ipynb
44-44: Found useless expression. Either assign it to a variable or remove it.
(B018)
| get_ipython().system('pip install bs4') | ||
| get_ipython().system('pip install requests') | ||
| get_ipython().system('pip install pandas') | ||
|
|
There was a problem hiding this comment.
Remove notebook-only get_ipython() installs (breaks as a .py).
These calls raise NameError outside Jupyter and block execution.
Apply this diff:
-get_ipython().system('pip install bs4')
-get_ipython().system('pip install requests')
-get_ipython().system('pip install pandas')Consider managing deps via requirements.txt instead.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| get_ipython().system('pip install bs4') | |
| get_ipython().system('pip install requests') | |
| get_ipython().system('pip install pandas') |
🧰 Tools
🪛 Ruff (0.13.1)
17-17: Undefined name get_ipython
(F821)
18-18: Undefined name get_ipython
(F821)
19-19: Undefined name get_ipython
(F821)
🤖 Prompt for AI Agents
In 2_gyum/aladin.py around lines 17 to 20, remove the notebook-only
get_ipython().system('pip install ...') calls because they raise NameError
outside Jupyter; instead delete those lines, ensure the module just imports bs4,
requests, and pandas at top, and add these packages to requirements.txt (or
document them in installation instructions); if runtime installation is
required, replace notebook calls with a safe subprocess.run(...) installation
guarded by a check, but preferred fix is to remove the in-file installs and
manage dependencies via requirements.txt.
| get_ipython().system('pip install bs4') | ||
| get_ipython().system('pip install requests') | ||
| get_ipython().system('pip install pandas') | ||
| get_ipython().system('pip install openpyxl') | ||
|
|
There was a problem hiding this comment.
Remove notebook-only get_ipython() installs (breaks as a .py).
Apply this diff:
-get_ipython().system('pip install bs4')
-get_ipython().system('pip install requests')
-get_ipython().system('pip install pandas')
-get_ipython().system('pip install openpyxl')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| get_ipython().system('pip install bs4') | |
| get_ipython().system('pip install requests') | |
| get_ipython().system('pip install pandas') | |
| get_ipython().system('pip install openpyxl') |
🧰 Tools
🪛 Ruff (0.13.1)
18-18: Undefined name get_ipython
(F821)
19-19: Undefined name get_ipython
(F821)
20-20: Undefined name get_ipython
(F821)
21-21: Undefined name get_ipython
(F821)
🤖 Prompt for AI Agents
In 2_gyum/jisigin.py around lines 18-22, remove the notebook-specific
get_ipython().system('pip install ...') calls because they break when the file
is run as a .py; instead delete those four lines and document or declare the
dependencies (bs4, requests, pandas, openpyxl) in requirements.txt or setup
metadata, or if you need runtime installs add a safe non-notebook fallback that
uses subprocess.run to call pip only under a guarded main/runtime-install path.
| get_ipython().system('pip install gdeltdoc') | ||
| get_ipython().system('pip install newspaper3k==0.2.8') | ||
| get_ipython().system('pip install lxml_html_clean') | ||
| get_ipython().system('pip install pandas') | ||
|
|
There was a problem hiding this comment.
Remove notebook-only get_ipython() installs (breaks as a .py).
These will NameError outside Jupyter.
Apply this diff:
-get_ipython().system('pip install gdeltdoc')
-get_ipython().system('pip install newspaper3k==0.2.8')
-get_ipython().system('pip install lxml_html_clean')
-get_ipython().system('pip install pandas')Manage dependencies via requirements.txt.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| get_ipython().system('pip install gdeltdoc') | |
| get_ipython().system('pip install newspaper3k==0.2.8') | |
| get_ipython().system('pip install lxml_html_clean') | |
| get_ipython().system('pip install pandas') |
🧰 Tools
🪛 Ruff (0.13.1)
14-14: Undefined name get_ipython
(F821)
15-15: Undefined name get_ipython
(F821)
16-16: Undefined name get_ipython
(F821)
17-17: Undefined name get_ipython
(F821)
🤖 Prompt for AI Agents
In 2_gyum/newpaper.py around lines 14 to 18, remove the notebook-specific
get_ipython().system('pip install ...') calls because they raise NameError when
run as a .py; instead, delete those lines and add the required packages
(gdeltdoc, newspaper3k==0.2.8, lxml_html_clean, pandas) to the project's
requirements.txt (or setup/install instructions) so dependencies are installed
outside the script; ensure any CI/dev docs or README are updated to instruct
installing dependencies via pip install -r requirements.txt.
| get_ipython().system('pip install selenium') | ||
| get_ipython().system('pip install bs4') | ||
| get_ipython().system('pip install pandas') | ||
| get_ipython().system('pip install openpyxl') | ||
|
|
There was a problem hiding this comment.
Remove notebook-only get_ipython() installs (breaks as a .py).
Apply this diff:
-get_ipython().system('pip install selenium')
-get_ipython().system('pip install bs4')
-get_ipython().system('pip install pandas')
-get_ipython().system('pip install openpyxl')Use requirements.txt/poetry for deps.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| get_ipython().system('pip install selenium') | |
| get_ipython().system('pip install bs4') | |
| get_ipython().system('pip install pandas') | |
| get_ipython().system('pip install openpyxl') |
🧰 Tools
🪛 Ruff (0.13.1)
15-15: Undefined name get_ipython
(F821)
16-16: Undefined name get_ipython
(F821)
17-17: Undefined name get_ipython
(F821)
18-18: Undefined name get_ipython
(F821)
🤖 Prompt for AI Agents
In 2_gyum/yanolja.py around lines 15-19, remove the notebook-only
get_ipython().system('pip install ...') calls (these break when the file runs as
a .py); instead add the dependencies selenium, bs4, pandas, openpyxl to the
project's dependency manager (requirements.txt or pyproject.toml/poetry) and
ensure any CI/container setup installs them before runtime; delete the four
get_ipython lines and update documentation/README to instruct installing
dependencies via pip install -r requirements.txt or poetry install.
| "response = requests.get(url) # 요청 보내기\n", | ||
| "html = response.text # 응답 받은 HTML 문서\n", | ||
| "soup = BeautifulSoup(html,'html.parser') # BeautifulSoup으로 파싱\n", | ||
| "soup" |
There was a problem hiding this comment.
🧩 Analysis chain
Add User-Agent, timeout, and error handling to requests.
Prevents 403/anti-bot blocks and hanging cells; surfaces HTTP errors early.
Apply this diff:
-response = requests.get(url) # 요청 보내기
-html = response.text # 응답 받은 HTML 문서
-soup = BeautifulSoup(html,'html.parser') # BeautifulSoup으로 파싱
+headers = {
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+}
+resp = requests.get(url, headers=headers, timeout=10) # 요청 보내기
+resp.raise_for_status()
+html = resp.text # 응답 받은 HTML 문서
+soup = BeautifulSoup(html, 'html.parser') # BeautifulSoup으로 파싱🌐 Web query:
Is crawling of Naver Knowledge iN (kin.naver.com) search result pages permitted by Naver's robots.txt/ToS? Provide the relevant section or official guidance.
💡 Result:
Short answer: No — Naver does not permit third‑party bulk collection of its service databases (including Knowledge iN). Naver’s official guidance says its crawlers follow robots.txt and that “네이버의 데이터베이스를 수집해가는 행위” (collecting Naver’s database) is disallowed and may lead to legal action. See Naver’s search-collection policy and related help docs. [1][2]
Relevant official excerpts (translated/summary):
- “네이버의 검색 로봇은 robots.txt에 따르는 국제적인 검색 규약을 준수합니다… 네이버의 데이터베이스를 수집해가는 행위는 … 경우에 따라 저작권 침해 또는 정보통신망법 등에 위반될 수 있습니다.” — Naver 검색 정책. [1]
- Help article on robots.txt / “robots.txt로 인해 정보를 수집할 수 없습니다.” (explains robots.txt use and that you must use robots.txt to block/allow crawlers). [2]
Sources:
[1] Naver — “검색결과 수집에 대한 정책” (Naver search collection policy / statement about robots.txt and prohibition on collecting Naver’s database).
[2] Naver 고객센터 — 검색 관련 도움말 (robots.txt / collection guidance).
Add User-Agent, timeout, and error handling — and stop scraping Naver Knowledge iN (kin.naver.com) without permission
File: static-crawling/jisigin.ipynb Lines: 1046-1049 — implement the request hardening below, but do not proceed to bulk-collect kin.naver.com: Naver’s policy disallows collecting its database (including Knowledge iN); remove this scrape, obtain explicit permission, or use an official API.
Apply this diff:
-response = requests.get(url) # 요청 보내기
-html = response.text # 응답 받은 HTML 문서
-soup = BeautifulSoup(html,'html.parser') # BeautifulSoup으로 파싱
+headers = {
+ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+}
+resp = requests.get(url, headers=headers, timeout=10) # 요청 보내기
+resp.raise_for_status()
+html = resp.text # 응답 받은 HTML 문서
+soup = BeautifulSoup(html, 'html.parser') # BeautifulSoup으로 파싱📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "response = requests.get(url) # 요청 보내기\n", | |
| "html = response.text # 응답 받은 HTML 문서\n", | |
| "soup = BeautifulSoup(html,'html.parser') # BeautifulSoup으로 파싱\n", | |
| "soup" | |
| headers = { | |
| "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" | |
| } | |
| resp = requests.get(url, headers=headers, timeout=10) # 요청 보내기 | |
| resp.raise_for_status() | |
| html = resp.text # 응답 받은 HTML 문서 | |
| soup = BeautifulSoup(html, 'html.parser') # BeautifulSoup으로 파싱 | |
| soup |
🤖 Prompt for AI Agents
In static-crawling/jisigin.ipynb around lines 1046 to 1049, the requests.get
call is insecure and the notebook is attempting to scrape kin.naver.com which is
disallowed: replace/remove that scrape and harden any remaining HTTP requests by
adding a descriptive User-Agent header, a reasonable timeout, and try/except
around the request to handle requests.exceptions (timeout, connection errors,
HTTP errors); if the target is kin.naver.com, remove the scraping code or stop
and document that explicit permission or an official API is required before
collecting data and do not perform bulk collection.
|
|
||
|
|
||
|
|
||
|
|
Summary by CodeRabbit
New Features
Chores