3기_4주차_이두호 by dhleesep9 · Pull Request #2 · HateSlop/3-huggingface

dhleesep9 · 2025-10-02T21:16:37Z

영어 리뷰를 받고 -> 한국어로 번역한 후 -> 다국어 감정분석 기능을 지원하는 모델을 통해 평점을 추론하는 코드를 짰습니다.

Summary by CodeRabbit

New Features
- Introduced an end-to-end NLP pipeline that loads sample reviews, translates text to Korean, and adds emotion labels.
- Provides tabular previews of intermediate and final results for quick inspection.
- Leverages GPU acceleration when available for faster processing.
Refactor
- Replaced placeholder notebook content with concrete, staged data processing and labeling steps.
- Streamlined dataset processing using batched/map-based flows.
Chores
- Added runtime progress and sample summaries to improve traceability during execution.

coderabbitai · 2025-10-02T21:16:44Z

Walkthrough

Introduces an end-to-end NLP pipeline that loads a Yelp Polarity subset, translates English reviews to Korean with NLLB, classifies sentiment on the Korean text with a multilingual model, and stores results. Adds helper functions, applies map-based processing, previews via DataFrames, and includes a parallel script version.

Changes

Cohort / File(s)	Summary
Notebook pipeline updates `transformer.ipynb`	Replaced placeholders with a working pipeline: load YelpPolarity subset (20 samples), remove label, translate to Korean, classify sentiment on translated text, preview via DataFrames. Added functions: `translate_english_to_korean(example)`, `analyze_emotion(example)`.
Scripted pipeline addition `dhleesep9/transformer.py`	New script implementing the same pipeline with progress prints, device (CPU/GPU) detection, dataset previews, and sample summaries. No exported/public API changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Notebook/Script
  participant Datasets as HuggingFace Datasets
  participant NLLB as Translation Model (NLLB 600M)
  participant Sentiment as Multilingual Sentiment Model
  participant Storage as Dataset w/ new fields

  User->>Notebook/Script: Run pipeline
  Notebook/Script->>Datasets: load_dataset("yelp_polarity", split="train")
  Note right of Notebook/Script: Select first 20 samples, drop label

  loop Map: translate_english_to_korean
    Notebook/Script->>NLLB: translate(example["text"], en->ko)
    NLLB-->>Notebook/Script: korean text
    Notebook/Script->>Storage: add field "korean_translate"
  end

  loop Map: analyze_emotion
    Notebook/Script->>Sentiment: classify("korean_translate")
    Sentiment-->>Notebook/Script: top label
    Notebook/Script->>Storage: add field "emotion"
  end

  Notebook/Script-->>User: Preview DataFrames and sample summaries

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I hop through tensors, ears held high,
Translating whispers as stars drift by.
From Yelp to Hangul, sentiments bloom—
Five little samples dance in the room.
Pipelines hum, my whiskers twitch:
Map, classify—carrots in a switch! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The title “3기_4주차_이두호” merely references a course and author without summarizing the actual code changes, so it fails to convey the main purpose of translating English reviews to Korean and analyzing sentiment.	Please update the title to clearly reflect the primary change, for example “Add English-to-Korean translation and sentiment analysis pipeline for Yelp reviews.”

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

transformer.py (1)
12-112: Avoid executing the whole pipeline at import time.
Importing this module immediately downloads models, loads datasets, and runs translation/emotion inference. That makes the module unusable for reuse or testing and wastes resources whenever it’s imported. Wrap the workflow in a main() (or similar) guarded by if __name__ == "__main__": so callers can choose when to run it.

Suggested structure:
-full_dataset = load_dataset("yelp_polarity", split="train")
-# …
-for i in range(min(5, len(final_dataset))):
-  ...
+def main():
+  full_dataset = load_dataset("yelp_polarity", split="train")
+  # …
+  for i in range(min(5, len(final_dataset))):
+    ...
+
+
+if __name__ == "__main__":
+  main()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fcf7c2 and 4e41935.

📒 Files selected for processing (2)

transformer.ipynb (1 hunks)
transformer.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.2)

transformer.py

28-28: Undefined name display

(F821)

67-67: Undefined name display

(F821)

103-103: Undefined name display

(F821)

coderabbitai · 2025-10-02T21:22:07Z

+# 데이터셋 확인을 위해 Pandas DataFrame으로 변환 (선택 사항)
+df_check = pd.DataFrame(full_dataset_subset)
+print("\n데이터셋 일부 미리보기 (DataFrame):")
+display(df_check.head(3)) # Colab 환경에서는 display()가 표 형태로 보여줍니다


⚠️ Potential issue | 🔴 Critical

Fix display NameError.
Running this module outside a notebook will crash because display is undefined. Import it (or replace with print) before use.

Apply this diff:

-from transformers import pipeline -import torch -import pandas as pd # 데이터 확인용 +from transformers import pipeline +import torch +import pandas as pd # 데이터 확인용 +from IPython.display import display

Also applies to: 67-67, 103-103

🧰 Tools

🪛 Ruff (0.13.2)

28-28: Undefined name display

(F821)

🤖 Prompt for AI Agents

In transformer.py around lines 28, 67, and 103, calls to display(...) will raise NameError when run outside a notebook; replace or import it so the module runs in non-interactive environments. Either import display from IPython.display at the top of the file (from IPython.display import display) or change those three display(...) calls to print(...) (or use df.head(n).to_string() for nicer formatting); update imports or calls consistently for all three locations.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

dhleesep9/transformer.py (1)

1-5: The display issue has already been flagged.

The missing from IPython.display import display import has been identified in a previous review. Please address that comment to resolve the NameError at lines 28, 67, and 103.

🧹 Nitpick comments (2)

dhleesep9/transformer.py (2)

55-56: Consider adding error handling for robustness.

The .map() call assumes all translations will succeed. If the translation pipeline encounters an error (e.g., network issues, model failures, or problematic input), the entire process will crash without helpful diagnostics.

Consider wrapping the translation logic with try-except:

 def translate_english_to_korean(example):
   """데이터셋의 'text'를 받아 영어 번역 결과를 반환하는 함수"""
-  translation_result = translator(
-        example['text'],
-        tgt_lang="kor_Hang",
-        src_lang="eng_Latn",
-        max_length=400,
-        min_length=30,
-        do_sample=False  # deterministic 출력을 원하면 False, 다양성을 원하면 True
-    )
-    # 결과에서 번역 텍스트 추출
-  example['korean_translate'] = translation_result[0]['translation_text']
+  try:
+    translation_result = translator(
+          example['text'],
+          tgt_lang="kor_Hang",
+          src_lang="eng_Latn",
+          max_length=400,
+          min_length=30,
+          do_sample=False  # deterministic 출력을 원하면 False, 다양성을 원하면 True
+      )
+      # 결과에서 번역 텍스트 추출
+    example['korean_translate'] = translation_result[0]['translation_text']
+  except Exception as e:
+    print(f"Translation failed for text: {example['text'][:50]}... Error: {e}")
+    example['korean_translate'] = ""  # Fallback to empty string
   return example

84-87: Consider adding error handling and clarifying nested access.

Similar to the translation function, this lacks error handling. Additionally, the [0][0] indexing pattern (due to top_k=1 returning a nested structure) could be clearer.

Apply this diff to improve robustness and readability:

 def analyze_emotion(example):
-  emotion_result = emotion_classifier(example['korean_translate'])
-  example['emotion'] = emotion_result[0][0]['label']
+  try:
+    emotion_result = emotion_classifier(example['korean_translate'])
+    # top_k=1 returns [[{label, score}]]
+    example['emotion'] = emotion_result[0][0]['label']
+  except Exception as e:
+    print(f"Emotion analysis failed for text: {example['korean_translate'][:50]}... Error: {e}")
+    example['emotion'] = "unknown"  # Fallback
   return example

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e41935 and 9142b05.

📒 Files selected for processing (1)

dhleesep9/transformer.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.3)

dhleesep9/transformer.py

28-28: Undefined name display

(F821)

67-67: Undefined name display

(F821)

103-103: Undefined name display

(F821)

🔇 Additional comments (5)

dhleesep9/transformer.py (5)

12-23: LGTM!

The dataset loading and sampling logic is clean and appropriate for an experimental pipeline. Removing the label column and selecting 20 samples makes sense for testing the translation and sentiment analysis workflow.

31-37: LGTM!

The translation pipeline is correctly configured with the NLLB model and proper device handling for GPU acceleration.

70-80: LGTM!

The emotion classification pipeline is properly configured with an appropriate multilingual model. The test with Korean text confirms the pipeline works as expected.

107-111: LGTM!

The final results display is well-structured. Truncating the original text to 100 characters prevents excessive output while still providing context for each sample.

47-48: Verify Yelp review length distribution before setting min_length

The script failed due to the missing datasets library; manually verify the word-count distribution of your Yelp reviews (e.g., install datasets and rerun or sample your dataset) to determine how many fall below 30 words. Depending on the result, either remove min_length or lower it (e.g., to 5).

coderabbitai · 2025-10-06T12:42:36Z

+
+# 번역을 수행하는 함수 정의
+def translate_english_to_korean(example):
+  """데이터셋의 'summary'를 받아 영어 번역 결과를 반환하는 함수"""


⚠️ Potential issue | 🟡 Minor

Fix the misleading docstring.

The docstring mentions receiving 'summary', but the function actually operates on the 'text' field at line 44. This inconsistency could confuse future maintainers.

Apply this diff:

- """데이터셋의 'summary'를 받아 영어 번역 결과를 반환하는 함수""" + """데이터셋의 'text'를 받아 영어 번역 결과를 반환하는 함수"""

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"""데이터셋의 'summary'를 받아 영어 번역 결과를 반환하는 함수"""

"""데이터셋의 'text'를 받아 영어 번역 결과를 반환하는 함수"""

🤖 Prompt for AI Agents

In dhleesep9/transformer.py around line 42, the docstring incorrectly states the function receives the dataset's 'summary' field while the function actually operates on the 'text' field at line 44; update the docstring to accurately describe that the function accepts the dataset's 'text' field (or use a more general phrase like "input text") and returns the English translation, keeping the docstring language consistent and concise.

hub2vu

과제 수고 많으셨습니다

hub2vu · 2025-10-09T09:07:42Z

+display(df_check.head(3)) # Colab 환경에서는 display()가 표 형태로 보여줍니다
+
+
+translator = pipeline(


긴 context가 model의 입력 한도를 넘을 수 있으니, 입력 토큰을 일정한 사이즈로 chunking -> 번역 -> 병합 하면 잘림도 방지하고 품질도 더 높아질 것 같습니다!

dhleesep9 added 3 commits October 2, 2025 21:02

[feat] 이두호

7def40c

[feat] 이두호

e1f6425

[feat] 이두호

4e41935

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

[feat] dhleesep9

9142b05

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

hub2vu reviewed Oct 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3기_4주차_이두호#2

3기_4주차_이두호#2
dhleesep9 wants to merge 4 commits intoHateSlop:mainfrom
dhleesep9:dhleesep9

dhleesep9 commented Oct 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 2, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 2, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 6, 2025

Uh oh!

hub2vu left a comment

Uh oh!

hub2vu Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"""데이터셋의 'summary'를 받아 영어 번역 결과를 반환하는 함수"""
	"""데이터셋의 'text'를 받아 영어 번역 결과를 반환하는 함수"""

		display(df_check.head(3)) # Colab 환경에서는 display()가 표 형태로 보여줍니다


		translator = pipeline(

Conversation

dhleesep9 commented Oct 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

hub2vu left a comment

Choose a reason for hiding this comment

Uh oh!

hub2vu Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhleesep9 commented Oct 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 2, 2025 •

edited

Loading