3기_4주차_박지민 by zzmnxn · Pull Request #12 · HateSlop/3-huggingface

zzmnxn · 2025-10-08T12:00:42Z

tweet_eval 데이터셋에 대한 감성 및 공격성 분류

Summary by CodeRabbit

New Features
- New news processing script that runs sentiment, offensive-content, and emotion analyses on tweet samples with visualizations.
Enhancements
- GPU/CPU device auto-detection for faster inference and progress logging with sample previews.
- Notebooks and scripts include runnable dataset loading, preprocessing, and richer UI outputs.
Documentation
- Tutorial-style cells and examples guiding the full pipeline.

coderabbitai · 2025-10-08T12:01:37Z

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title Check	❓ Inconclusive	The title “3기_4주차_박지민” is a generic assignment label that does not describe the changes introduced in this pull request, so it fails to convey the core updates such as adding a news processing script and enhancing notebooks with Hugging Face pipelines and execution metadata.	Please rename the pull request to a clear, concise summary of the primary change, for example “Add jimin/news.py sentiment/offensive/emotion pipelines and update Hugging Face notebooks with execution metadata.”

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fcf7c2 and e65b06d.

📒 Files selected for processing (3)

huggingface_basics.ipynb (32 hunks)
jimin/news.py (1 hunks)
transformer.ipynb (14 hunks)

🧰 Additional context used

🪛 Ruff (0.13.3)

jimin/news.py

116-116: Undefined name display

(F821)

transformer.ipynb

116-116: Undefined name display

(F821)

186-186: Undefined name display

(F821)

248-248: Undefined name display

(F821)

huggingface_basics.ipynb

30-30: Found useless expression. Either assign it to a variable or remove it.

(B018)

122-122: Found useless expression. Either assign it to a variable or remove it.

(B018)

125-125: Found useless expression. Either assign it to a variable or remove it.

(B018)

161-161: Redefinition of unused pd from line 136

Remove definition: pd

(F811)

coderabbitai · 2025-10-08T12:06:38Z

+    "generator = pipeline('text-generation', model='gpt2', device=0) # GPU 사용 시\n",
+    "#generator = pipeline('text-generation', model='gpt2') # CPU 사용 시 (기본값)\n",
    "\n",
+    "# 텍스트 생성을 시작할 프롬프트(prompt) 또는 시작 문맥(context)\n",
+    "prompt = \"Once upon a time, in a land far, far away,\"\n",
    "\n",
-    "# GPT-2는 기본 pad 토큰이 없으므로, eos 토큰을 pad 토큰으로 설정 (일반적인 관행)\n",
-    "if gpt_tokenizer.pad_token is None:\n",
-    "    gpt_tokenizer.pad_token = gpt_tokenizer.eos_token\n",
+    "# 파이프라인을 사용하여 텍스트 생성 실행\n",
+    "# max_length: 생성될 텍스트의 최대 길이 (프롬프트 포함)\n",
+    "# num_return_sequences: 몇 개의 다른 생성 결과를 반환할지\n",
+    "# 결과는 리스트 형태로 반환되며, 각 요소는 딕셔너리 형태입니다.\n",
+    "results = generator(prompt, max_length=50, num_return_sequences=1)\n",
    "\n",
-    "# 2. 입력 텍스트 준비\n",
-    "text = \"What is Huggingface Transformer?\"\n",
-    "encoded_input = gpt_tokenizer(text, return_tensors='pt')\n",
-    "input_ids = encoded_input['input_ids']\n",
+    "# 생성된 텍스트 출력\n",
+    "print(f\"\\n프롬프트: {prompt}\")\n",
+    "print(\"--- 생성된 텍스트 ---\")\n",
+    "for result in results:\n",
+    "    print(result['generated_text'])\n",
    "\n",


⚠️ Potential issue | 🟠 Major

Don’t hard-code GPU usage for generator

Setting device=0 forces GPU even when Colab/locals lack one, leading to AssertionError: Torch not compiled with CUDA enabled. Reuse the earlier device guard (0 or -1) so the cell runs everywhere.

-generator = pipeline('text-generation', model='gpt2', device=0) # GPU 사용 시 -#generator = pipeline('text-generation', model='gpt2') # CPU 사용 시 (기본값) +device = 0 if torch.cuda.is_available() else -1 +generator = pipeline('text-generation', model='gpt2', device=device)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"generator = pipeline('text-generation', model='gpt2', device=0) # GPU 사용 시\n",

"#generator = pipeline('text-generation', model='gpt2') # CPU 사용 시 (기본값)\n",

"\n",

"# 텍스트 생성을 시작할 프롬프트(prompt) 또는 시작 문맥(context)\n",

"prompt = \"Once upon a time, in a land far, far away,\"\n",

"\n",

"# GPT-2는 기본 pad 토큰이 없으므로, eos 토큰을 pad 토큰으로 설정 (일반적인 관행)\n",

"if gpt_tokenizer.pad_token is None:\n",

" gpt_tokenizer.pad_token = gpt_tokenizer.eos_token\n",

"# 파이프라인을 사용하여 텍스트 생성 실행\n",

"# max_length: 생성될 텍스트의 최대 길이 (프롬프트 포함)\n",

"# num_return_sequences: 몇 개의 다른 생성 결과를 반환할지\n",

"# 결과는 리스트 형태로 반환되며, 각 요소는 딕셔너리 형태입니다.\n",

"results = generator(prompt, max_length=50, num_return_sequences=1)\n",

"\n",

"# 2. 입력 텍스트 준비\n",

"text = \"What is Huggingface Transformer?\"\n",

"encoded_input = gpt_tokenizer(text, return_tensors='pt')\n",

"input_ids = encoded_input['input_ids']\n",

"# 생성된 텍스트 출력\n",

"print(f\"\\n프롬프트: {prompt}\")\n",

"print(\"--- 생성된 텍스트 ---\")\n",

"for result in results:\n",

" print(result['generated_text'])\n",

"\n",

# Choose device at runtime: GPU if available else CPU

device = 0 if torch.cuda.is_available() else -1

generator = pipeline('text-generation', model='gpt2', device=device)

# 텍스트 생성을 시작할 프롬프트(prompt) 또는 시작 문맥(context)

prompt = "Once upon a time, in a land far, far away,"

# 파이프라인을 사용하여 텍스트 생성 실행

# max_length: 생성될 텍스트의 최대 길이 (프롬프트 포함)

# num_return_sequences: 몇 개의 다른 생성 결과를 반환할지

# 결과는 리스트 형태로 반환되며, 각 요소는 딕셔너리 형태입니다.

results = generator(prompt, max_length=50, num_return_sequences=1)

# 생성된 텍스트 출력

print(f"\n프롬프트: {prompt}")

print("--- 생성된 텍스트 ---")

for result in results:

print(result['generated_text'])

🤖 Prompt for AI Agents

In huggingface_basics.ipynb around lines 1689 to 1706, the generator pipeline is hard-coded with device=0 which forces GPU and causes an assertion on machines without CUDA; change the pipeline call to use the previously computed device guard (e.g., device variable that is 0 when CUDA is available and -1 otherwise) so the pipeline becomes device-aware and runs on CPU when no GPU is present; update the commented GPU/CPU examples to reference this device variable instead of hard-coding 0.

coderabbitai · 2025-10-08T12:06:38Z

+import datasets
+from datasets import load_dataset, DatasetDict
+from transformers import pipeline
+import torch
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns


⚠️ Potential issue | 🔴 Critical

Fix the display NameError in script execution

Running this .py file outside a notebook will raise NameError: name 'display' is not defined, halting the pipeline before results print. Add a safe import (with fallback) near the top so the script works in both notebook and CLI runs.

+try: + from IPython.display import display +except ImportError: + def display(value): + print(value)

Also applies to: 109-117

🤖 Prompt for AI Agents

In jimin/news.py at lines 1-7 (and similarly around lines 109-117) the script uses display which exists in notebooks but raises NameError when run as a CLI script; add a safe import/fallback near the top: attempt to import display from IPython.display inside a try/except and on ImportError define a no-op fallback function (or a simple wrapper that prints the object) so calls to display work both in notebooks and in non-interactive CLI runs; update any usage sites if necessary to rely on this fallback.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

jimin/news.py (3)

58-60: Refactor conditional statements for better readability.

The if-elif-else chain uses multiple statements on one line, which reduces readability and violates PEP 8 style guidelines.

Apply this diff to fix the formatting:

-    if label == 'LABEL_0': label_name = 'Negative'
-    elif label == 'LABEL_1': label_name = 'Neutral'
-    else: label_name = 'Positive'
+    if label == 'LABEL_0':
+        label_name = 'Negative'
+    elif label == 'LABEL_1':
+        label_name = 'Neutral'
+    else:
+        label_name = 'Positive'

Alternatively, use a dictionary for cleaner mapping:

-    if label == 'LABEL_0': label_name = 'Negative'
-    elif label == 'LABEL_1': label_name = 'Neutral'
-    else: label_name = 'Positive'
+    label_map = {'LABEL_0': 'Negative', 'LABEL_1': 'Neutral', 'LABEL_2': 'Positive'}
+    label_name = label_map.get(label, 'Positive')

90-91: Refactor conditional statements for better readability.

Similar to the sentiment analysis function, these conditional statements should be split across multiple lines for better readability.

Apply this diff:

-    if label == 'LABEL_1': label_name = 'Offensive'
-    else: label_name = 'Not Offensive'
+    if label == 'LABEL_1':
+        label_name = 'Offensive'
+    else:
+        label_name = 'Not Offensive'

Or use a dictionary:

-    if label == 'LABEL_1': label_name = 'Offensive'
-    else: label_name = 'Not Offensive'
+    label_map = {'LABEL_1': 'Offensive', 'LABEL_0': 'Not Offensive'}
+    label_name = label_map.get(label, 'Not Offensive')

127-128: Add emotion analysis columns to the display once the emotion analysis is fixed.

After fixing the analyze_emotion function, remember to include the emotion analysis results in the preview columns.

Apply this diff:

 # 최종 확인을 위해 필요한 컬럼만 선택
-columns_to_show = ['text', 'predicted_sentiment', 'sentiment_score', 'is_offensive', 'offensive_score']
+columns_to_show = ['text', 'predicted_sentiment', 'sentiment_score', 'is_offensive', 'offensive_score', 'emotion', 'emotion_score']
 df_final_preview = df_final[columns_to_show]

Similarly, update the sample output loop at lines 134-138 to display emotion results:

for i in range(min(3, len(final_dataset))):
    print(f"\n--- 샘플 {i+1} ---")
    print(f"원문: {final_dataset[i]['text'][:70].replace('\n', ' ')}...")
    print(f"예측 감성: {final_dataset[i]['predicted_sentiment']} (Score: {final_dataset[i]['sentiment_score']:.3f})")
    print(f"공격성 여부: {final_dataset[i]['is_offensive']} (Score: {final_dataset[i]['offensive_score']:.3f})")
    print(f"감정: {final_dataset[i]['emotion']} (Score: {final_dataset[i]['emotion_score']:.3f})")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e65b06d and b3f569d.

📒 Files selected for processing (1)

jimin/news.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.3)

jimin/news.py

58-58: Multiple statements on one line (colon)

(E701)

59-59: Multiple statements on one line (colon)

(E701)

60-60: Multiple statements on one line (colon)

(E701)

90-90: Multiple statements on one line (colon)

(E701)

91-91: Multiple statements on one line (colon)

(E701)

🔇 Additional comments (3)

jimin/news.py (3)

1-16: LGTM!

The imports and device configuration are properly set up. The device selection logic correctly falls back to CPU when CUDA is not available.

17-35: LGTM!

Dataset loading and initial setup look correct. The subset selection and label information display are properly implemented.

1-10: Past review comment about display NameError is no longer applicable.

The previous review flagged a potential NameError for the display function. However, the current code does not use display() anywhere—it only uses print() statements. This past concern has been resolved.

coderabbitai · 2025-10-08T12:15:39Z

+    return example
+
+# 데이터셋에 적용
+final_dataset = processing_subset.map(analyze_offensive)


⚠️ Potential issue | 🔴 Critical

Fix dataset processing chain to preserve previous results.

The offensive classification results are assigned to final_dataset, but Line 120 overwrites this by mapping analyze_emotion on processing_subset (which only has sentiment results). This means the final dataset will be missing offensive classification data.

Apply this diff to fix the processing chain:

# 데이터셋에 적용 -final_dataset = processing_subset.map(analyze_offensive) +processing_subset = processing_subset.map(analyze_offensive) print("\n[작업 2] 혐오/공격성 분류 완료. 'is_offensive' 컬럼 추가.")

This ensures that emotion analysis (line 120) is applied on top of both sentiment and offensive classifications.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

final_dataset = processing_subset.map(analyze_offensive)

# 데이터셋에 적용

processing_subset = processing_subset.map(analyze_offensive)

print("\n[작업 2] 혐오/공격성 분류 완료. 'is_offensive' 컬럼 추가.")

🤖 Prompt for AI Agents

In jimin/news.py around lines 98 and 120, the offensive classification results are assigned to final_dataset at line 98 but later line 120 maps analyze_emotion over processing_subset, which drops the offensive labels; change the emotion mapping to run on the accumulated final_dataset (i.e., map analyze_emotion on final_dataset instead of processing_subset) so the pipeline composes sentiment -> offensive -> emotion and preserves previous results.

coderabbitai · 2025-10-08T12:15:40Z

+def analyze_emotion(example):
+    """'english_summary'의 감정을 분석하고 'emotion' 컬럼에 추가"""
+    # 감정 분석 파이프라인은 텍스트를 리스트 형태로 받으므로, 입력 텍스트를 리스트로 감싸야 합니다.
+    emotion_result = emotion_classifier(example['english_summary'])
+    # top_k=1 옵션으로 인해 결과는 [[{'label': '...', 'score': ...}]] 형태
+    example['emotion'] = emotion_result[0][0]['label']
+    example['emotion_score'] = emotion_result[0][0]['score'] # 점수도 함께 저장
+    return example


⚠️ Potential issue | 🔴 Critical

Fix KeyError: 'english_summary' field does not exist in the dataset.

The analyze_emotion function attempts to access example['english_summary'], but this field is never created in the dataset. The dataset only contains the 'text' field from the tweet_eval dataset. This will cause a KeyError when the function runs.

Apply this diff to use the correct field:

def analyze_emotion(example): - """'english_summary'의 감정을 분석하고 'emotion' 컬럼에 추가""" + """트윗 텍스트의 감정을 분석하고 'emotion' 컬럼에 추가""" # 감정 분석 파이프라인은 텍스트를 리스트 형태로 받으므로, 입력 텍스트를 리스트로 감싸야 합니다. - emotion_result = emotion_classifier(example['english_summary']) + emotion_result = emotion_classifier(example['text']) # top_k=1 옵션으로 인해 결과는 [[{'label': '...', 'score': ...}]] 형태 example['emotion'] = emotion_result[0][0]['label'] example['emotion_score'] = emotion_result[0][0]['score'] # 점수도 함께 저장 return example

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def analyze_emotion(example):

"""'english_summary'의 감정을 분석하고 'emotion' 컬럼에 추가"""

# 감정 분석 파이프라인은 텍스트를 리스트 형태로 받으므로, 입력 텍스트를 리스트로 감싸야 합니다.

emotion_result = emotion_classifier(example['english_summary'])

# top_k=1 옵션으로 인해 결과는 [[{'label': '...', 'score': ...}]] 형태

example['emotion'] = emotion_result[0][0]['label']

example['emotion_score'] = emotion_result[0][0]['score'] # 점수도 함께 저장

return example

def analyze_emotion(example):

"""트윗 텍스트의 감정을 분석하고 'emotion' 컬럼에 추가"""

# 감정 분석 파이프라인은 텍스트를 리스트 형태로 받으므로, 입력 텍스트를 리스트로 감싸야 합니다.

emotion_result = emotion_classifier(example['text'])

# top_k=1 옵션으로 인해 결과는 [[{'label': '...', 'score': ...}]] 형태

example['emotion'] = emotion_result[0][0]['label']

example['emotion_score'] = emotion_result[0][0]['score'] # 점수도 함께 저장

return example

🤖 Prompt for AI Agents

In jimin/news.py around lines 110-117, the function accesses a non-existent key 'english_summary' causing KeyError; change it to use the dataset's actual field 'text', pass that text as a single-item list into the emotion classifier, and assign the returned label and score into example['emotion'] and example['emotion_score']; also guard against missing text by using example.get('text', '') or returning the example unchanged if no text is present.

zzmnxn and others added 3 commits October 2, 2025 11:26

[feat] : week 04 huggingface

fd990e4

[feat] : week 04 huggingface

6af046c

[feat] : 4주차 과제

e65b06d

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

[feat] : 4주차 과제

b3f569d

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3기_4주차_박지민#12

3기_4주차_박지민#12
zzmnxn wants to merge 4 commits intoHateSlop:mainfrom
zzmnxn:jimin

zzmnxn commented Oct 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zzmnxn commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zzmnxn commented Oct 8, 2025 •

edited

Loading

coderabbitai bot commented Oct 8, 2025 •

edited

Loading