Skip to content

3기_4주차_이두호#2

Open
dhleesep9 wants to merge 4 commits intoHateSlop:mainfrom
dhleesep9:dhleesep9
Open

3기_4주차_이두호#2
dhleesep9 wants to merge 4 commits intoHateSlop:mainfrom
dhleesep9:dhleesep9

Conversation

@dhleesep9
Copy link
Copy Markdown

@dhleesep9 dhleesep9 commented Oct 2, 2025

영어 리뷰를 받고 -> 한국어로 번역한 후 -> 다국어 감정분석 기능을 지원하는 모델을 통해 평점을 추론하는 코드를 짰습니다.

Summary by CodeRabbit

  • New Features

    • Introduced an end-to-end NLP pipeline that loads sample reviews, translates text to Korean, and adds emotion labels.
    • Provides tabular previews of intermediate and final results for quick inspection.
    • Leverages GPU acceleration when available for faster processing.
  • Refactor

    • Replaced placeholder notebook content with concrete, staged data processing and labeling steps.
    • Streamlined dataset processing using batched/map-based flows.
  • Chores

    • Added runtime progress and sample summaries to improve traceability during execution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 2, 2025

Walkthrough

Introduces an end-to-end NLP pipeline that loads a Yelp Polarity subset, translates English reviews to Korean with NLLB, classifies sentiment on the Korean text with a multilingual model, and stores results. Adds helper functions, applies map-based processing, previews via DataFrames, and includes a parallel script version.

Changes

Cohort / File(s) Summary
Notebook pipeline updates
transformer.ipynb
Replaced placeholders with a working pipeline: load YelpPolarity subset (20 samples), remove label, translate to Korean, classify sentiment on translated text, preview via DataFrames. Added functions: translate_english_to_korean(example), analyze_emotion(example).
Scripted pipeline addition
dhleesep9/transformer.py
New script implementing the same pipeline with progress prints, device (CPU/GPU) detection, dataset previews, and sample summaries. No exported/public API changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Notebook/Script
  participant Datasets as HuggingFace Datasets
  participant NLLB as Translation Model (NLLB 600M)
  participant Sentiment as Multilingual Sentiment Model
  participant Storage as Dataset w/ new fields

  User->>Notebook/Script: Run pipeline
  Notebook/Script->>Datasets: load_dataset("yelp_polarity", split="train")
  Note right of Notebook/Script: Select first 20 samples, drop label

  loop Map: translate_english_to_korean
    Notebook/Script->>NLLB: translate(example["text"], en->ko)
    NLLB-->>Notebook/Script: korean text
    Notebook/Script->>Storage: add field "korean_translate"
  end

  loop Map: analyze_emotion
    Notebook/Script->>Sentiment: classify("korean_translate")
    Sentiment-->>Notebook/Script: top label
    Notebook/Script->>Storage: add field "emotion"
  end

  Notebook/Script-->>User: Preview DataFrames and sample summaries
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I hop through tensors, ears held high,
Translating whispers as stars drift by.
From Yelp to Hangul, sentiments bloom—
Five little samples dance in the room.
Pipelines hum, my whiskers twitch:
Map, classify—carrots in a switch! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The title “3기_4주차_이두호” merely references a course and author without summarizing the actual code changes, so it fails to convey the main purpose of translating English reviews to Korean and analyzing sentiment. Please update the title to clearly reflect the primary change, for example “Add English-to-Korean translation and sentiment analysis pipeline for Yelp reviews.”
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
transformer.py (1)

12-112: Avoid executing the whole pipeline at import time.
Importing this module immediately downloads models, loads datasets, and runs translation/emotion inference. That makes the module unusable for reuse or testing and wastes resources whenever it’s imported. Wrap the workflow in a main() (or similar) guarded by if __name__ == "__main__": so callers can choose when to run it.

Suggested structure:

-full_dataset = load_dataset("yelp_polarity", split="train")
-# …
-for i in range(min(5, len(final_dataset))):
-  ...
+def main():
+  full_dataset = load_dataset("yelp_polarity", split="train")
+  # …
+  for i in range(min(5, len(final_dataset))):
+    ...
+
+
+if __name__ == "__main__":
+  main()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fcf7c2 and 4e41935.

📒 Files selected for processing (2)
  • transformer.ipynb (1 hunks)
  • transformer.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.2)
transformer.py

28-28: Undefined name display

(F821)


67-67: Undefined name display

(F821)


103-103: Undefined name display

(F821)

Comment thread dhleesep9/transformer.py
# 데이터셋 확인을 위해 Pandas DataFrame으로 변환 (선택 사항)
df_check = pd.DataFrame(full_dataset_subset)
print("\n데이터셋 일부 미리보기 (DataFrame):")
display(df_check.head(3)) # Colab 환경에서는 display()가 표 형태로 보여줍니다
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix display NameError.
Running this module outside a notebook will crash because display is undefined. Import it (or replace with print) before use.

Apply this diff:

-from transformers import pipeline
-import torch
-import pandas as pd # 데이터 확인용
+from transformers import pipeline
+import torch
+import pandas as pd # 데이터 확인용
+from IPython.display import display

Also applies to: 67-67, 103-103

🧰 Tools
🪛 Ruff (0.13.2)

28-28: Undefined name display

(F821)

🤖 Prompt for AI Agents
In transformer.py around lines 28, 67, and 103, calls to display(...) will raise
NameError when run outside a notebook; replace or import it so the module runs
in non-interactive environments. Either import display from IPython.display at
the top of the file (from IPython.display import display) or change those three
display(...) calls to print(...) (or use df.head(n).to_string() for nicer
formatting); update imports or calls consistently for all three locations.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
dhleesep9/transformer.py (1)

1-5: The display issue has already been flagged.

The missing from IPython.display import display import has been identified in a previous review. Please address that comment to resolve the NameError at lines 28, 67, and 103.

🧹 Nitpick comments (2)
dhleesep9/transformer.py (2)

55-56: Consider adding error handling for robustness.

The .map() call assumes all translations will succeed. If the translation pipeline encounters an error (e.g., network issues, model failures, or problematic input), the entire process will crash without helpful diagnostics.

Consider wrapping the translation logic with try-except:

 def translate_english_to_korean(example):
   """데이터셋의 'text'를 받아 영어 번역 결과를 반환하는 함수"""
-  translation_result = translator(
-        example['text'],
-        tgt_lang="kor_Hang",
-        src_lang="eng_Latn",
-        max_length=400,
-        min_length=30,
-        do_sample=False  # deterministic 출력을 원하면 False, 다양성을 원하면 True
-    )
-    # 결과에서 번역 텍스트 추출
-  example['korean_translate'] = translation_result[0]['translation_text']
+  try:
+    translation_result = translator(
+          example['text'],
+          tgt_lang="kor_Hang",
+          src_lang="eng_Latn",
+          max_length=400,
+          min_length=30,
+          do_sample=False  # deterministic 출력을 원하면 False, 다양성을 원하면 True
+      )
+      # 결과에서 번역 텍스트 추출
+    example['korean_translate'] = translation_result[0]['translation_text']
+  except Exception as e:
+    print(f"Translation failed for text: {example['text'][:50]}... Error: {e}")
+    example['korean_translate'] = ""  # Fallback to empty string
   return example

84-87: Consider adding error handling and clarifying nested access.

Similar to the translation function, this lacks error handling. Additionally, the [0][0] indexing pattern (due to top_k=1 returning a nested structure) could be clearer.

Apply this diff to improve robustness and readability:

 def analyze_emotion(example):
-  emotion_result = emotion_classifier(example['korean_translate'])
-  example['emotion'] = emotion_result[0][0]['label']
+  try:
+    emotion_result = emotion_classifier(example['korean_translate'])
+    # top_k=1 returns [[{label, score}]]
+    example['emotion'] = emotion_result[0][0]['label']
+  except Exception as e:
+    print(f"Emotion analysis failed for text: {example['korean_translate'][:50]}... Error: {e}")
+    example['emotion'] = "unknown"  # Fallback
   return example
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e41935 and 9142b05.

📒 Files selected for processing (1)
  • dhleesep9/transformer.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
dhleesep9/transformer.py

28-28: Undefined name display

(F821)


67-67: Undefined name display

(F821)


103-103: Undefined name display

(F821)

🔇 Additional comments (5)
dhleesep9/transformer.py (5)

12-23: LGTM!

The dataset loading and sampling logic is clean and appropriate for an experimental pipeline. Removing the label column and selecting 20 samples makes sense for testing the translation and sentiment analysis workflow.


31-37: LGTM!

The translation pipeline is correctly configured with the NLLB model and proper device handling for GPU acceleration.


70-80: LGTM!

The emotion classification pipeline is properly configured with an appropriate multilingual model. The test with Korean text confirms the pipeline works as expected.


107-111: LGTM!

The final results display is well-structured. Truncating the original text to 100 characters prevents excessive output while still providing context for each sample.


47-48: Verify Yelp review length distribution before setting min_length

The script failed due to the missing datasets library; manually verify the word-count distribution of your Yelp reviews (e.g., install datasets and rerun or sample your dataset) to determine how many fall below 30 words. Depending on the result, either remove min_length or lower it (e.g., to 5).

Comment thread dhleesep9/transformer.py

# 번역을 수행하는 함수 정의
def translate_english_to_korean(example):
"""데이터셋의 'summary'를 받아 영어 번역 결과를 반환하는 함수"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the misleading docstring.

The docstring mentions receiving 'summary', but the function actually operates on the 'text' field at line 44. This inconsistency could confuse future maintainers.

Apply this diff:

-  """데이터셋의 'summary'를 받아 영어 번역 결과를 반환하는 함수"""
+  """데이터셋의 'text'를 받아 영어 번역 결과를 반환하는 함수"""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""데이터셋의 'summary'를 받아 영어 번역 결과를 반환하는 함수"""
"""데이터셋의 'text'를 받아 영어 번역 결과를 반환하는 함수"""
🤖 Prompt for AI Agents
In dhleesep9/transformer.py around line 42, the docstring incorrectly states the
function receives the dataset's 'summary' field while the function actually
operates on the 'text' field at line 44; update the docstring to accurately
describe that the function accepts the dataset's 'text' field (or use a more
general phrase like "input text") and returns the English translation, keeping
the docstring language consistent and concise.

Copy link
Copy Markdown

@hub2vu hub2vu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

과제 수고 많으셨습니다

Comment thread dhleesep9/transformer.py
display(df_check.head(3)) # Colab 환경에서는 display()가 표 형태로 보여줍니다


translator = pipeline(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

긴 context가 model의 입력 한도를 넘을 수 있으니, 입력 토큰을 일정한 사이즈로 chunking -> 번역 -> 병합 하면 잘림도 방지하고 품질도 더 높아질 것 같습니다!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants