Skip to content

3기_2주차_이홍겸#5

Open
ghdrua1 wants to merge 1 commit intoHateSlop:mainfrom
ghdrua1:gyum
Open

3기_2주차_이홍겸#5
ghdrua1 wants to merge 1 commit intoHateSlop:mainfrom
ghdrua1:gyum

Conversation

@ghdrua1
Copy link
Copy Markdown

@ghdrua1 ghdrua1 commented Sep 23, 2025

Summary by CodeRabbit

  • New Features

    • Added static crawler for Aladin bestsellers exporting CSV.
    • Added static crawler for Naver Knowledge iN exporting Excel with title, link, date, category, and hits.
    • Introduced GDELT + Newspaper workflow to fetch article metadata and full text to CSV.
    • Added Yanolja review crawler (script and notebook) using Selenium, extracting reviews/ratings, computing averages and top words, exporting to Excel.
  • Chores

    • Removed the instructional Aladin static crawling notebook.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Sep 23, 2025

Walkthrough

Adds new web crawling scripts and notebooks: static crawlers for Aladin and Naver Jisigin, a dynamic Selenium-based Yanolja review crawler, and a GDELT+Newspaper3k article fetcher. Aggregates parsed data into pandas DataFrames and persists to CSV/Excel. Also removes a static-crawling instructional notebook.

Changes

Cohort / File(s) Summary
Static crawl — Aladin
2_gyum/aladin.py
New script to fetch Aladin bestseller pages (pages 1–3), parse with BeautifulSoup, extract title/link/price/rating, aggregate to DataFrame, save to CSV.
Static crawl — Naver Jisigin
2_gyum/jisigin.py, static-crawling/jisigin.ipynb
Adds script and notebook to crawl Jisigin search results (pages 1–3), extract title/link/date/category/hit, build DataFrame, export to Excel; notebook includes installation/output cells and metadata updates.
Dynamic crawl — Yanolja reviews
2_gyum/yanolja.py, dynamic-crawling/yanolja.ipynb
Introduces Selenium-based review scraping, scrolling, BeautifulSoup parsing, rating derivation, DataFrame creation, average rating and frequent words, export to Excel; notebook contains executed flow and cleanup.
News articles via GDELT + Newspaper3k
2_gyum/newpaper.py
New script querying GDELT with filters, fetching full text via Newspaper3k, assembling DataFrame (title/url/text), saving to CSV.
Removal — static crawling assignment
static-crawling/static-crawling_assignment.ipynb
Deletes instructional notebook for Aladin static crawling.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Requests as HTTP Client
  participant AladinSite as Aladin Website
  participant BS as BeautifulSoup Parser
  participant PD as pandas
  participant FS as File System

  User->>Requests: GET bestseller?page=1..3
  Requests-->>AladinSite: HTTP requests
  AladinSite-->>Requests: HTML responses
  Requests-->>BS: HTML content
  BS-->>PD: Extract rows (title, link, price, rating)
  PD-->>FS: to_csv("aladin_crawling.csv")
  Note over PD,FS: Static crawl aggregation and export
Loading
sequenceDiagram
  autonumber
  actor User
  participant Requests as HTTP Client
  participant Jisigin as Naver Jisigin
  participant BS as BeautifulSoup
  participant PD as pandas
  participant FS as File System

  loop pages 1..3
    User->>Requests: GET search page
    Requests-->>Jisigin: HTTP
    Jisigin-->>Requests: HTML
    Requests-->>BS: Parse results
    BS-->>PD: Append (title, link, date, category, hit)
  end
  PD-->>FS: to_excel("jisigin.xlsx")
Loading
sequenceDiagram
  autonumber
  actor User
  participant Selenium as Selenium WebDriver
  participant Yanolja as Yanolja Reviews Page
  participant BS as BeautifulSoup
  participant PD as pandas
  participant FS as File System

  User->>Selenium: Launch Chrome, get(url)
  loop Scroll to load more
    Selenium->>Yanolja: execute_script(scroll)
    Yanolja-->>Selenium: Updated DOM
  end
  Selenium-->>BS: page_source
  BS-->>PD: Extract reviews & derive ratings
  PD-->>FS: to_excel("yanolja.xlsx")
  Selenium-->>User: Quit
Loading
sequenceDiagram
  autonumber
  actor User
  participant GDELT as GDELT API
  participant News as News Sites
  participant NP as Newspaper3k
  participant PD as pandas
  participant FS as File System

  User->>GDELT: article_search(filters)
  GDELT-->>User: Article metadata (titles, URLs)
  loop for each URL
    User->>NP: download(url), parse()
    NP-->>User: title, text
  end
  User->>PD: Build DataFrame (title, url, text)
  PD-->>FS: to_csv("articles_data.csv")
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A rabbit taps keys with caffeinated cheer,
Scrapes stars and reviews, brings data near.
From GDELT’s streams to booklists bright,
It harvests text by Selenium night.
With CSVs stacked and Excel in tow,
Hippity-hop—off to insights we go! 🐇📊

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The title "3기_2주차_이홍겸" is a short author/week label but does not describe the substantive changes in this PR (which add multiple crawling scripts and notebooks), so it is too generic to convey the main change to reviewers. Rename the PR to a concise, descriptive summary of the primary change, for example "Add web crawling scripts and notebooks (Aladin, Jisigin, Yanolja, GDELT)"; if the PR primarily targets one module, name that module specifically (e.g., "Add Aladin static crawler and CSV export").
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Comment @coderabbitai help to get the list of available commands and usage tips.

@ghdrua1 ghdrua1 changed the title 3기_2주차_이홍 3기_2주차_이홍겸 Sep 23, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (29)
2_gyum/aladin.py (6)

36-39: Add timeout and status check to HTTP request; drop useless expression.

Prevents hangs and surfaces HTTP errors. The bare soup expression is no-op in scripts.

Apply this diff:

-response = requests.get(url)  # 요청 보내기
-html = response.text  # 응답 받은 HTML 문서
-soup = BeautifulSoup(html,'html.parser')  # BeautifulSoup으로 파싱
-soup
+response = requests.get(url, timeout=10)  # 요청 보내기
+response.raise_for_status()
+html = response.text  # 응답 받은 HTML 문서
+soup = BeautifulSoup(html, 'html.parser')  # BeautifulSoup으로 파싱

51-53: Remove no-op expression.

tree alone does nothing; keep only the assignment or print explicitly.

Apply this diff:

-tree = soup.select_one('#Myform > .ss_book_box')
-tree
+tree = soup.select_one('#Myform > .ss_book_box')

96-107: Replace bare except with specific/logged handling.

Bare except: continue hides bugs and makes debugging hard.

Apply this diff:

-for tree in trees:
-    try:
-        title = tree.select_one('.bo3')
-        title_text = title.text
-        title_link = title.attrs['href']
-
-        price = tree.select_one(".ss_p2").text
-        review = tree.select_one(".star_score").text
-        
-        print(title_text, title_link, price, review)
-    except: continue
+for tree in trees:
+    try:
+        title = tree.select_one('.bo3')
+        title_text = title.text
+        title_link = title.attrs['href']
+        price = tree.select_one(".ss_p2").text
+        review = tree.select_one(".star_score").text
+        print(title_text, title_link, price, review)
+    except AttributeError as e:
+        # 일부 요소가 없는 카드 스킵
+        continue
+    except Exception as e:
+        # 예기치 못한 예외 로깅 후 스킵
+        print(f"skip item due to error: {e}")
+        continue

129-133: Add timeout and status check for paginated requests.

Apply this diff:

-    response = requests.get(url)  # 요청 보내기
-    html = response.text  # 응답 받은 HTML 문서
+    response = requests.get(url, timeout=10)  # 요청 보내기
+    response.raise_for_status()
+    html = response.text  # 응답 받은 HTML 문서

146-146: Avoid bare except here as well.

Apply this diff:

-        except: continue
+        except AttributeError:
+            continue
+        except Exception as e:
+            print(f"skip item due to error: {e}")
+            continue

148-151: Drop no-op expression; optionally show a small preview.

Apply this diff:

-df = pd.DataFrame(datas,columns=['title_text','title_link','price','review'])
-df
+df = pd.DataFrame(datas, columns=['title_text','title_link','price','review'])
+print(df.shape)
2_gyum/newpaper.py (2)

63-66: Rename unused loop index.

Apply this diff:

-for index, row in articles.iterrows():
+for _idx, row in articles.iterrows():

67-79: Harden article fetch with error handling (network/parser failures are common).

Apply this diff:

-    # Newspaper3k 라이브러리를 사용하여 기사 본문 추출
-    article = Article(url)
-    article.download()
-    article.parse()
-    text = article.text  # 기사 본문
-    
-    # DataFrame에 추가
-    articles_data.append({
-        "title": title,
-        "url": url,
-        "text": text
-    })
+    # Newspaper3k 라이브러리를 사용하여 기사 본문 추출
+    try:
+        article = Article(url, language='en')
+        article.download()
+        article.parse()
+        text = article.text  # 기사 본문
+    except Exception as e:
+        print(f"skip {url}: {e}")
+        continue
+
+    # DataFrame에 추가
+    articles_data.append({"title": title, "url": url, "text": text})
2_gyum/yanolja.py (3)

28-36: Prefer explicit waits and (optionally) headless/options for robustness.

Sleeping is flaky; waiting for elements is more stable. Headless is CI-friendly.

Example (outside this hunk):

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".css-1kpa3g > p")))

Optionally use webdriver-manager or set ChromeOptions headless.


119-125: Guard against length mismatch between ratings and reviews.

Zipping silently drops extras; trim or assert.

Apply this diff:

-# 별점과 리뷰를 결합하여 리스트 생성
-data = list(zip(ratings, reviews))
+# 별점과 리뷰를 결합하여 리스트 생성
+if len(ratings) != len(reviews):
+    min_len = min(len(ratings), len(reviews))
+    ratings, reviews = ratings[:min_len], reviews[:min_len]
+data = list(zip(ratings, reviews))

207-209: Ensure driver quits on failure.

Wrap critical steps in try/finally to avoid orphaned Chrome processes.

Example (outside this hunk):

driver = webdriver.Chrome()
try:
    # ... crawling logic ...
    final_df.to_excel('yanolja.xlsx', index=False)
finally:
    driver.quit()
2_gyum/jisigin.py (5)

38-42: Add timeout/status check; drop no-op expression.

Apply this diff:

-response = requests.get(url)  # 요청 보내기
-html = response.text  # 응답 받은 HTML 문서
-soup = BeautifulSoup(html,'html.parser')  # BeautifulSoup으로 파싱
-soup
+response = requests.get(url, timeout=10)  # 요청 보내기
+response.raise_for_status()
+html = response.text  # 응답 받은 HTML 문서
+soup = BeautifulSoup(html, 'html.parser')  # BeautifulSoup으로 파싱

54-56: Remove no-op expression.

Apply this diff:

-tree = soup.select_one('.basic1 > li > dl')
-tree  # 첫 번째 질문의 HTML 구조를 출력하여 확인
+tree = soup.select_one('.basic1 > li > dl')

98-101: Make hit parsing resilient.

Splitting and indexing can crash if format changes.

Apply this diff:

-texts = hit_tag.text
-hit = texts.split()[1]
+texts = hit_tag.text
+parts = [p for p in texts.split() if p.isdigit()]
+hit = parts[0] if parts else ""

Alternatively use regex to extract digits.


139-143: Add timeout/status check in pagination loop.

Apply this diff:

-    response = requests.get(url)
+    response = requests.get(url, timeout=10)
+    response.raise_for_status()

156-159: Drop no-op expression or print a preview.

Apply this diff:

-df = pd.DataFrame(data,columns=['title','link','date','category','hit'])
-
-df
+df = pd.DataFrame(data, columns=['title','link','date','category','hit'])
+print(df.shape)
dynamic-crawling/yanolja.ipynb (4)

22-67: Clear notebook outputs before committing.

Huge stored outputs bloat diffs and repo size. Re-run or use “Clear All Outputs” before save.


82-95: Prefer WebDriverWait over sleep for stability.

Replace sleeps with waits for elements/scroll completion.

Example (outside this hunk):

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Wait for reviews to appear
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".css-1kpa3g > p")))

768-778: Guard against ratings/reviews length mismatch before DataFrame.

Apply this diff:

-# DataFrame으로 변환
-df_reviews = pd.DataFrame(data, columns=['Rating', 'Review'])
-df_reviews.index += 1
+# DataFrame으로 변환
+df_reviews = pd.DataFrame(data, columns=['Rating', 'Review'])
+df_reviews.index += 1

And add (outside this hunk):

if len(ratings) != len(reviews):
    min_len = min(len(ratings), len(reviews))
    ratings, reviews = ratings[:min_len], reviews[:min_len]

1057-1060: Remove teaching placeholder once filled.

######## your code here ######## can be dropped now.

Apply this diff:

-######## your code here ########
 final_df.to_excel('yanolja.xlsx',index=False)
static-crawling/jisigin.ipynb (9)

28-52: Strip large notebook outputs before committing (repo bloat).

The rendered HTML and pip logs make the notebook heavy and noisy in diffs. Commit with outputs cleared (e.g., nbstripout) to keep the repo lean.


54-58: Avoid installing packages inside notebooks.

Prefer a requirements.txt/pyproject.toml and a reproducible environment over in-notebook pip installs.


1049-1050: Don’t print the entire DOM; limit or format output.

Reduces diff noise and avoids Ruff B018 “useless expression.”

Apply this diff:

-soup
+print(soup.prettify()[:2000])

1098-1100: Guard against no-result pages.

soup.select_one may return None; accessing it later will crash.

Apply this diff:

-tree = soup.select_one('.basic1 > li > dl')
+tree = soup.select_one('.basic1 > li > dl')
+if tree is None:
+    raise ValueError("No results found for the query.")

1131-1134: Simplify selector for the title anchor.

More robust and readable than escaping colons.

Apply this diff:

-title_tag = tree.select_one("._nclicks\\:kin\\.txt._searchListTitleAnchor")
+title_tag = tree.select_one("dt > a._searchListTitleAnchor")
 title = title_tag.text
 link = title_tag.attrs['href'] 

1191-1196: Fix terminology: this is answer count, not view count.

The .hit element contains “답변수 N”.

Apply this diff:

-# 조회수 추출
+# 답변수 추출
 hit_tag = tree.select_one('.hit')
 texts = hit_tag.text
 hit = texts.split()[1]
 print(hit)

1232-1243: Be defensive when extracting per-item fields.

Any missing element (title/category/hit) will raise. Skip incomplete items.

Apply this diff:

 for tree in trees:
-    title = tree.select_one("._nclicks\\:kin\\.txt").text
-    link = tree.select_one("._nclicks\\:kin\\.txt").attrs['href']
-    date = tree.select_one(".txt_inline").text
-    category = tree.select_one("._nclicks\\:kin\\.cat2").text
-    hit = tree.select_one(".hit").text.split()[1]
+    title_el = tree.select_one("dt > a._searchListTitleAnchor")
+    date_el = tree.select_one(".txt_inline")
+    category_el = tree.select_one("._nclicks\\:kin\\.cat2")
+    hit_el = tree.select_one(".hit")
+    if not (title_el and date_el and category_el and hit_el):
+        continue
+    title = title_el.get_text(strip=True)
+    link = title_el['href']
+    date = date_el.get_text(strip=True)
+    category = category_el.get_text(strip=True)
+    hit = hit_el.get_text(strip=True).split()[1]

1638-1661: Harden multi-page crawl: session, headers, timeouts, error handling, and polite delay.

Improves reliability and reduces block risk.

Apply this diff:

-# 여러 페이지에서 정보 추출
-data = []
-for page_num in range(1, 4):  # 1~3페이지 크롤링
-    url = f"https://kin.naver.com/search/list.naver?query=%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90&page={page_num}"
-    response = requests.get(url)
-    html = response.text
-    soup = BeautifulSoup(html, 'html.parser')
-    trees = soup.select(".basic1 > li > dl")
-    
-    for tree in trees:
-        ################
-        title = tree.select_one("._nclicks\\:kin\\.txt").text
-        link = tree.select_one("._nclicks\\:kin\\.txt").attrs['href']
-        date = tree.select_one(".txt_inline").text
-        category = tree.select_one("._nclicks\\:kin\\.cat2").text
-        hit = tree.select_one(".hit").text.split()[1]
-        # 데이터를 리스트에 추가
-        data.append([title, link, date, category, hit])
+# 여러 페이지에서 정보 추출
+import time
+data = []
+session = requests.Session()
+headers = {
+    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+}
+for page_num in range(1, 4):  # 1~3페이지 크롤링
+    url = f"https://kin.naver.com/search/list.naver?query=%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90&page={page_num}"
+    resp = session.get(url, headers=headers, timeout=10)
+    resp.raise_for_status()
+    html = resp.text
+    soup = BeautifulSoup(html, 'html.parser')
+    trees = soup.select(".basic1 > li > dl")
+
+    for tree in trees:
+        title_el = tree.select_one("dt > a._searchListTitleAnchor")
+        date_el = tree.select_one(".txt_inline")
+        category_el = tree.select_one("._nclicks\\:kin\\.cat2")
+        hit_el = tree.select_one(".hit")
+        if not (title_el and date_el and category_el and hit_el):
+            continue
+        title = title_el.get_text(strip=True)
+        link = title_el['href']
+        date = date_el.get_text(strip=True)
+        category = category_el.get_text(strip=True)
+        hit_txt = hit_el.get_text(strip=True)
+        # "답변수 10" -> 10 (digits only)
+        hit = "".join(ch for ch in hit_txt if ch.isdigit())
+        data.append([title, link, date, category, hit])
+    time.sleep(1)  # polite delay

1680-1683: Consider normalizing dtypes before saving.

Converting date to datetime and hit to numeric makes the Excel more useful downstream.

Example (apply before to_excel):

df['date'] = pd.to_datetime(df['date'], errors='coerce', format='%Y.%m.%d.')
df['hit'] = pd.to_numeric(df['hit'], errors='coerce').astype('Int64')
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e2a4f7 and f0579b6.

⛔ Files ignored due to path filters (4)
  • api-practice/articles_data.csv is excluded by !**/*.csv
  • dynamic-crawling/yanolja.xlsx is excluded by !**/*.xlsx
  • static-crawling/aladin_crawling.csv is excluded by !**/*.csv
  • static-crawling/jisigin.xlsx is excluded by !**/*.xlsx
📒 Files selected for processing (7)
  • 2_gyum/aladin.py (1 hunks)
  • 2_gyum/jisigin.py (1 hunks)
  • 2_gyum/newpaper.py (1 hunks)
  • 2_gyum/yanolja.py (1 hunks)
  • dynamic-crawling/yanolja.ipynb (13 hunks)
  • static-crawling/jisigin.ipynb (9 hunks)
  • static-crawling/static-crawling_assignment.ipynb (0 hunks)
💤 Files with no reviewable changes (1)
  • static-crawling/static-crawling_assignment.ipynb
🧰 Additional context used
🪛 Ruff (0.13.1)
2_gyum/yanolja.py

1-1: Shebang is present but file is not executable

(EXE001)


15-15: Undefined name get_ipython

(F821)


16-16: Undefined name get_ipython

(F821)


17-17: Undefined name get_ipython

(F821)


18-18: Undefined name get_ipython

(F821)


78-78: Found useless expression. Either assign it to a variable or remove it.

(B018)


108-108: Found useless expression. Either assign it to a variable or remove it.

(B018)


170-170: Found useless expression. Either assign it to a variable or remove it.

(B018)


187-187: Found useless expression. Either assign it to a variable or remove it.

(B018)

static-crawling/jisigin.ipynb

84-84: Found useless expression. Either assign it to a variable or remove it.

(B018)

2_gyum/newpaper.py

1-1: Shebang is present but file is not executable

(EXE001)


14-14: Undefined name get_ipython

(F821)


15-15: Undefined name get_ipython

(F821)


16-16: Undefined name get_ipython

(F821)


17-17: Undefined name get_ipython

(F821)


63-63: Loop control variable index not used within loop body

Rename unused index to _index

(B007)

2_gyum/aladin.py

1-1: Shebang is present but file is not executable

(EXE001)


17-17: Undefined name get_ipython

(F821)


18-18: Undefined name get_ipython

(F821)


19-19: Undefined name get_ipython

(F821)


36-36: Probable use of requests call without timeout

(S113)


39-39: Found useless expression. Either assign it to a variable or remove it.

(B018)


52-52: Found useless expression. Either assign it to a variable or remove it.

(B018)


107-107: Do not use bare except

(E722)


107-107: try-except-continue detected, consider logging the exception

(S112)


107-107: Multiple statements on one line (colon)

(E701)


129-129: Probable use of requests call without timeout

(S113)


146-146: Do not use bare except

(E722)


146-146: try-except-continue detected, consider logging the exception

(S112)


146-146: Multiple statements on one line (colon)

(E701)


149-149: Found useless expression. Either assign it to a variable or remove it.

(B018)

2_gyum/jisigin.py

1-1: Shebang is present but file is not executable

(EXE001)


18-18: Undefined name get_ipython

(F821)


19-19: Undefined name get_ipython

(F821)


20-20: Undefined name get_ipython

(F821)


21-21: Undefined name get_ipython

(F821)


38-38: Probable use of requests call without timeout

(S113)


41-41: Found useless expression. Either assign it to a variable or remove it.

(B018)


55-55: Found useless expression. Either assign it to a variable or remove it.

(B018)


139-139: Probable use of requests call without timeout

(S113)


158-158: Found useless expression. Either assign it to a variable or remove it.

(B018)

dynamic-crawling/yanolja.ipynb

44-44: Found useless expression. Either assign it to a variable or remove it.

(B018)

Comment thread 2_gyum/aladin.py
Comment on lines +17 to +20
get_ipython().system('pip install bs4')
get_ipython().system('pip install requests')
get_ipython().system('pip install pandas')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove notebook-only get_ipython() installs (breaks as a .py).

These calls raise NameError outside Jupyter and block execution.

Apply this diff:

-get_ipython().system('pip install bs4')
-get_ipython().system('pip install requests')
-get_ipython().system('pip install pandas')

Consider managing deps via requirements.txt instead.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
get_ipython().system('pip install bs4')
get_ipython().system('pip install requests')
get_ipython().system('pip install pandas')
🧰 Tools
🪛 Ruff (0.13.1)

17-17: Undefined name get_ipython

(F821)


18-18: Undefined name get_ipython

(F821)


19-19: Undefined name get_ipython

(F821)

🤖 Prompt for AI Agents
In 2_gyum/aladin.py around lines 17 to 20, remove the notebook-only
get_ipython().system('pip install ...') calls because they raise NameError
outside Jupyter; instead delete those lines, ensure the module just imports bs4,
requests, and pandas at top, and add these packages to requirements.txt (or
document them in installation instructions); if runtime installation is
required, replace notebook calls with a safe subprocess.run(...) installation
guarded by a check, but preferred fix is to remove the in-file installs and
manage dependencies via requirements.txt.

Comment thread 2_gyum/jisigin.py
Comment on lines +18 to +22
get_ipython().system('pip install bs4')
get_ipython().system('pip install requests')
get_ipython().system('pip install pandas')
get_ipython().system('pip install openpyxl')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove notebook-only get_ipython() installs (breaks as a .py).

Apply this diff:

-get_ipython().system('pip install bs4')
-get_ipython().system('pip install requests')
-get_ipython().system('pip install pandas')
-get_ipython().system('pip install openpyxl')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
get_ipython().system('pip install bs4')
get_ipython().system('pip install requests')
get_ipython().system('pip install pandas')
get_ipython().system('pip install openpyxl')
🧰 Tools
🪛 Ruff (0.13.1)

18-18: Undefined name get_ipython

(F821)


19-19: Undefined name get_ipython

(F821)


20-20: Undefined name get_ipython

(F821)


21-21: Undefined name get_ipython

(F821)

🤖 Prompt for AI Agents
In 2_gyum/jisigin.py around lines 18-22, remove the notebook-specific
get_ipython().system('pip install ...') calls because they break when the file
is run as a .py; instead delete those four lines and document or declare the
dependencies (bs4, requests, pandas, openpyxl) in requirements.txt or setup
metadata, or if you need runtime installs add a safe non-notebook fallback that
uses subprocess.run to call pip only under a guarded main/runtime-install path.

Comment thread 2_gyum/newpaper.py
Comment on lines +14 to +18
get_ipython().system('pip install gdeltdoc')
get_ipython().system('pip install newspaper3k==0.2.8')
get_ipython().system('pip install lxml_html_clean')
get_ipython().system('pip install pandas')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove notebook-only get_ipython() installs (breaks as a .py).

These will NameError outside Jupyter.

Apply this diff:

-get_ipython().system('pip install gdeltdoc')
-get_ipython().system('pip install newspaper3k==0.2.8')
-get_ipython().system('pip install lxml_html_clean')
-get_ipython().system('pip install pandas')

Manage dependencies via requirements.txt.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
get_ipython().system('pip install gdeltdoc')
get_ipython().system('pip install newspaper3k==0.2.8')
get_ipython().system('pip install lxml_html_clean')
get_ipython().system('pip install pandas')
🧰 Tools
🪛 Ruff (0.13.1)

14-14: Undefined name get_ipython

(F821)


15-15: Undefined name get_ipython

(F821)


16-16: Undefined name get_ipython

(F821)


17-17: Undefined name get_ipython

(F821)

🤖 Prompt for AI Agents
In 2_gyum/newpaper.py around lines 14 to 18, remove the notebook-specific
get_ipython().system('pip install ...') calls because they raise NameError when
run as a .py; instead, delete those lines and add the required packages
(gdeltdoc, newspaper3k==0.2.8, lxml_html_clean, pandas) to the project's
requirements.txt (or setup/install instructions) so dependencies are installed
outside the script; ensure any CI/dev docs or README are updated to instruct
installing dependencies via pip install -r requirements.txt.

Comment thread 2_gyum/yanolja.py
Comment on lines +15 to +19
get_ipython().system('pip install selenium')
get_ipython().system('pip install bs4')
get_ipython().system('pip install pandas')
get_ipython().system('pip install openpyxl')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove notebook-only get_ipython() installs (breaks as a .py).

Apply this diff:

-get_ipython().system('pip install selenium')
-get_ipython().system('pip install bs4')
-get_ipython().system('pip install pandas')
-get_ipython().system('pip install openpyxl')

Use requirements.txt/poetry for deps.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
get_ipython().system('pip install selenium')
get_ipython().system('pip install bs4')
get_ipython().system('pip install pandas')
get_ipython().system('pip install openpyxl')
🧰 Tools
🪛 Ruff (0.13.1)

15-15: Undefined name get_ipython

(F821)


16-16: Undefined name get_ipython

(F821)


17-17: Undefined name get_ipython

(F821)


18-18: Undefined name get_ipython

(F821)

🤖 Prompt for AI Agents
In 2_gyum/yanolja.py around lines 15-19, remove the notebook-only
get_ipython().system('pip install ...') calls (these break when the file runs as
a .py); instead add the dependencies selenium, bs4, pandas, openpyxl to the
project's dependency manager (requirements.txt or pyproject.toml/poetry) and
ensure any CI/container setup installs them before runtime; delete the four
get_ipython lines and update documentation/README to instruct installing
dependencies via pip install -r requirements.txt or poetry install.

Comment on lines +1046 to 1049
"response = requests.get(url) # 요청 보내기\n",
"html = response.text # 응답 받은 HTML 문서\n",
"soup = BeautifulSoup(html,'html.parser') # BeautifulSoup으로 파싱\n",
"soup"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Add User-Agent, timeout, and error handling to requests.

Prevents 403/anti-bot blocks and hanging cells; surfaces HTTP errors early.

Apply this diff:

-response = requests.get(url)  # 요청 보내기
-html = response.text  # 응답 받은 HTML 문서
-soup = BeautifulSoup(html,'html.parser')  # BeautifulSoup으로 파싱
+headers = {
+    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+}
+resp = requests.get(url, headers=headers, timeout=10)  # 요청 보내기
+resp.raise_for_status()
+html = resp.text  # 응답 받은 HTML 문서
+soup = BeautifulSoup(html, 'html.parser')  # BeautifulSoup으로 파싱

🌐 Web query:

Is crawling of Naver Knowledge iN (kin.naver.com) search result pages permitted by Naver's robots.txt/ToS? Provide the relevant section or official guidance.

💡 Result:

Short answer: No — Naver does not permit third‑party bulk collection of its service databases (including Knowledge iN). Naver’s official guidance says its crawlers follow robots.txt and that “네이버의 데이터베이스를 수집해가는 행위” (collecting Naver’s database) is disallowed and may lead to legal action. See Naver’s search-collection policy and related help docs. [1][2]

Relevant official excerpts (translated/summary):

  • “네이버의 검색 로봇은 robots.txt에 따르는 국제적인 검색 규약을 준수합니다… 네이버의 데이터베이스를 수집해가는 행위는 … 경우에 따라 저작권 침해 또는 정보통신망법 등에 위반될 수 있습니다.” — Naver 검색 정책. [1]
  • Help article on robots.txt / “robots.txt로 인해 정보를 수집할 수 없습니다.” (explains robots.txt use and that you must use robots.txt to block/allow crawlers). [2]

Sources:
[1] Naver — “검색결과 수집에 대한 정책” (Naver search collection policy / statement about robots.txt and prohibition on collecting Naver’s database).
[2] Naver 고객센터 — 검색 관련 도움말 (robots.txt / collection guidance).


Add User-Agent, timeout, and error handling — and stop scraping Naver Knowledge iN (kin.naver.com) without permission

File: static-crawling/jisigin.ipynb Lines: 1046-1049 — implement the request hardening below, but do not proceed to bulk-collect kin.naver.com: Naver’s policy disallows collecting its database (including Knowledge iN); remove this scrape, obtain explicit permission, or use an official API.

Apply this diff:

-response = requests.get(url)  # 요청 보내기
-html = response.text  # 응답 받은 HTML 문서
-soup = BeautifulSoup(html,'html.parser')  # BeautifulSoup으로 파싱
+headers = {
+    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
+}
+resp = requests.get(url, headers=headers, timeout=10)  # 요청 보내기
+resp.raise_for_status()
+html = resp.text  # 응답 받은 HTML 문서
+soup = BeautifulSoup(html, 'html.parser')  # BeautifulSoup으로 파싱
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"response = requests.get(url) # 요청 보내기\n",
"html = response.text # 응답 받은 HTML 문서\n",
"soup = BeautifulSoup(html,'html.parser') # BeautifulSoup으로 파싱\n",
"soup"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
resp = requests.get(url, headers=headers, timeout=10) # 요청 보내기
resp.raise_for_status()
html = resp.text # 응답 받은 HTML 문서
soup = BeautifulSoup(html, 'html.parser') # BeautifulSoup으로 파싱
soup
🤖 Prompt for AI Agents
In static-crawling/jisigin.ipynb around lines 1046 to 1049, the requests.get
call is insecure and the notebook is attempting to scrape kin.naver.com which is
disallowed: replace/remove that scrape and harden any remaining HTTP requests by
adding a descriptive User-Agent header, a reasonable timeout, and try/except
around the request to handle requests.exceptions (timeout, connection errors,
HTTP errors); if the target is kin.naver.com, remove the scraping code or stop
and document that explicit permission or an official API is required before
collecting data and do not perform bulk collection.

@ghdrua1 ghdrua1 closed this Sep 23, 2025
@ghdrua1 ghdrua1 reopened this Sep 23, 2025
Comment thread 2_gyum/yanolja.py




Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2주차 과제 수고하셨습니다!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants