Skip to content

3기_4주차_박지민#12

Open
zzmnxn wants to merge 4 commits intoHateSlop:mainfrom
zzmnxn:jimin
Open

3기_4주차_박지민#12
zzmnxn wants to merge 4 commits intoHateSlop:mainfrom
zzmnxn:jimin

Conversation

@zzmnxn
Copy link
Copy Markdown

@zzmnxn zzmnxn commented Oct 8, 2025

tweet_eval 데이터셋에 대한 감성 및 공격성 분류

Summary by CodeRabbit

  • New Features

    • New news processing script that runs sentiment, offensive-content, and emotion analyses on tweet samples with visualizations.
  • Enhancements

    • GPU/CPU device auto-detection for faster inference and progress logging with sample previews.
    • Notebooks and scripts include runnable dataset loading, preprocessing, and richer UI outputs.
  • Documentation

    • Tutorial-style cells and examples guiding the full pipeline.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 8, 2025

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The title “3기_4주차_박지민” is a generic assignment label that does not describe the changes introduced in this pull request, so it fails to convey the core updates such as adding a news processing script and enhancing notebooks with Hugging Face pipelines and execution metadata. Please rename the pull request to a clear, concise summary of the primary change, for example “Add jimin/news.py sentiment/offensive/emotion pipelines and update Hugging Face notebooks with execution metadata.”
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fcf7c2 and e65b06d.

📒 Files selected for processing (3)
  • huggingface_basics.ipynb (32 hunks)
  • jimin/news.py (1 hunks)
  • transformer.ipynb (14 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
jimin/news.py

116-116: Undefined name display

(F821)

transformer.ipynb

116-116: Undefined name display

(F821)


186-186: Undefined name display

(F821)


248-248: Undefined name display

(F821)

huggingface_basics.ipynb

30-30: Found useless expression. Either assign it to a variable or remove it.

(B018)


122-122: Found useless expression. Either assign it to a variable or remove it.

(B018)


125-125: Found useless expression. Either assign it to a variable or remove it.

(B018)


161-161: Redefinition of unused pd from line 136

Remove definition: pd

(F811)

Comment thread huggingface_basics.ipynb
Comment on lines +1689 to 1706
"generator = pipeline('text-generation', model='gpt2', device=0) # GPU 사용 시\n",
"#generator = pipeline('text-generation', model='gpt2') # CPU 사용 시 (기본값)\n",
"\n",
"# 텍스트 생성을 시작할 프롬프트(prompt) 또는 시작 문맥(context)\n",
"prompt = \"Once upon a time, in a land far, far away,\"\n",
"\n",
"# GPT-2는 기본 pad 토큰이 없으므로, eos 토큰을 pad 토큰으로 설정 (일반적인 관행)\n",
"if gpt_tokenizer.pad_token is None:\n",
" gpt_tokenizer.pad_token = gpt_tokenizer.eos_token\n",
"# 파이프라인을 사용하여 텍스트 생성 실행\n",
"# max_length: 생성될 텍스트의 최대 길이 (프롬프트 포함)\n",
"# num_return_sequences: 몇 개의 다른 생성 결과를 반환할지\n",
"# 결과는 리스트 형태로 반환되며, 각 요소는 딕셔너리 형태입니다.\n",
"results = generator(prompt, max_length=50, num_return_sequences=1)\n",
"\n",
"# 2. 입력 텍스트 준비\n",
"text = \"What is Huggingface Transformer?\"\n",
"encoded_input = gpt_tokenizer(text, return_tensors='pt')\n",
"input_ids = encoded_input['input_ids']\n",
"# 생성된 텍스트 출력\n",
"print(f\"\\n프롬프트: {prompt}\")\n",
"print(\"--- 생성된 텍스트 ---\")\n",
"for result in results:\n",
" print(result['generated_text'])\n",
"\n",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t hard-code GPU usage for generator

Setting device=0 forces GPU even when Colab/locals lack one, leading to AssertionError: Torch not compiled with CUDA enabled. Reuse the earlier device guard (0 or -1) so the cell runs everywhere.

-generator = pipeline('text-generation', model='gpt2', device=0) # GPU 사용 시
-#generator = pipeline('text-generation', model='gpt2') # CPU 사용 시 (기본값)
+device = 0 if torch.cuda.is_available() else -1
+generator = pipeline('text-generation', model='gpt2', device=device)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"generator = pipeline('text-generation', model='gpt2', device=0) # GPU 사용 시\n",
"#generator = pipeline('text-generation', model='gpt2') # CPU 사용 시 (기본값)\n",
"\n",
"# 텍스트 생성을 시작할 프롬프트(prompt) 또는 시작 문맥(context)\n",
"prompt = \"Once upon a time, in a land far, far away,\"\n",
"\n",
"# GPT-2는 기본 pad 토큰이 없으므로, eos 토큰을 pad 토큰으로 설정 (일반적인 관행)\n",
"if gpt_tokenizer.pad_token is None:\n",
" gpt_tokenizer.pad_token = gpt_tokenizer.eos_token\n",
"# 파이프라인을 사용하여 텍스트 생성 실행\n",
"# max_length: 생성될 텍스트의 최대 길이 (프롬프트 포함)\n",
"# num_return_sequences: 몇 개의 다른 생성 결과를 반환할지\n",
"# 결과는 리스트 형태로 반환되며, 각 요소는 딕셔너리 형태입니다.\n",
"results = generator(prompt, max_length=50, num_return_sequences=1)\n",
"\n",
"# 2. 입력 텍스트 준비\n",
"text = \"What is Huggingface Transformer?\"\n",
"encoded_input = gpt_tokenizer(text, return_tensors='pt')\n",
"input_ids = encoded_input['input_ids']\n",
"# 생성된 텍스트 출력\n",
"print(f\"\\n프롬프트: {prompt}\")\n",
"print(\"--- 생성된 텍스트 ---\")\n",
"for result in results:\n",
" print(result['generated_text'])\n",
"\n",
# Choose device at runtime: GPU if available else CPU
device = 0 if torch.cuda.is_available() else -1
generator = pipeline('text-generation', model='gpt2', device=device)
# 텍스트 생성을 시작할 프롬프트(prompt) 또는 시작 문맥(context)
prompt = "Once upon a time, in a land far, far away,"
# 파이프라인을 사용하여 텍스트 생성 실행
# max_length: 생성될 텍스트의 최대 길이 (프롬프트 포함)
# num_return_sequences: 몇 개의 다른 생성 결과를 반환할지
# 결과는 리스트 형태로 반환되며, 각 요소는 딕셔너리 형태입니다.
results = generator(prompt, max_length=50, num_return_sequences=1)
# 생성된 텍스트 출력
print(f"\n프롬프트: {prompt}")
print("--- 생성된 텍스트 ---")
for result in results:
print(result['generated_text'])
🤖 Prompt for AI Agents
In huggingface_basics.ipynb around lines 1689 to 1706, the generator pipeline is
hard-coded with device=0 which forces GPU and causes an assertion on machines
without CUDA; change the pipeline call to use the previously computed device
guard (e.g., device variable that is 0 when CUDA is available and -1 otherwise)
so the pipeline becomes device-aware and runs on CPU when no GPU is present;
update the commented GPU/CPU examples to reference this device variable instead
of hard-coding 0.

Comment thread jimin/news.py
Comment on lines +1 to +7
import datasets
from datasets import load_dataset, DatasetDict
from transformers import pipeline
import torch
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the display NameError in script execution

Running this .py file outside a notebook will raise NameError: name 'display' is not defined, halting the pipeline before results print. Add a safe import (with fallback) near the top so the script works in both notebook and CLI runs.

+try:
+    from IPython.display import display
+except ImportError:
+    def display(value):
+        print(value)

Also applies to: 109-117

🤖 Prompt for AI Agents
In jimin/news.py at lines 1-7 (and similarly around lines 109-117) the script
uses display which exists in notebooks but raises NameError when run as a CLI
script; add a safe import/fallback near the top: attempt to import display from
IPython.display inside a try/except and on ImportError define a no-op fallback
function (or a simple wrapper that prints the object) so calls to display work
both in notebooks and in non-interactive CLI runs; update any usage sites if
necessary to rely on this fallback.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
jimin/news.py (3)

58-60: Refactor conditional statements for better readability.

The if-elif-else chain uses multiple statements on one line, which reduces readability and violates PEP 8 style guidelines.

Apply this diff to fix the formatting:

-    if label == 'LABEL_0': label_name = 'Negative'
-    elif label == 'LABEL_1': label_name = 'Neutral'
-    else: label_name = 'Positive'
+    if label == 'LABEL_0':
+        label_name = 'Negative'
+    elif label == 'LABEL_1':
+        label_name = 'Neutral'
+    else:
+        label_name = 'Positive'

Alternatively, use a dictionary for cleaner mapping:

-    if label == 'LABEL_0': label_name = 'Negative'
-    elif label == 'LABEL_1': label_name = 'Neutral'
-    else: label_name = 'Positive'
+    label_map = {'LABEL_0': 'Negative', 'LABEL_1': 'Neutral', 'LABEL_2': 'Positive'}
+    label_name = label_map.get(label, 'Positive')

90-91: Refactor conditional statements for better readability.

Similar to the sentiment analysis function, these conditional statements should be split across multiple lines for better readability.

Apply this diff:

-    if label == 'LABEL_1': label_name = 'Offensive'
-    else: label_name = 'Not Offensive'
+    if label == 'LABEL_1':
+        label_name = 'Offensive'
+    else:
+        label_name = 'Not Offensive'

Or use a dictionary:

-    if label == 'LABEL_1': label_name = 'Offensive'
-    else: label_name = 'Not Offensive'
+    label_map = {'LABEL_1': 'Offensive', 'LABEL_0': 'Not Offensive'}
+    label_name = label_map.get(label, 'Not Offensive')

127-128: Add emotion analysis columns to the display once the emotion analysis is fixed.

After fixing the analyze_emotion function, remember to include the emotion analysis results in the preview columns.

Apply this diff:

 # 최종 확인을 위해 필요한 컬럼만 선택
-columns_to_show = ['text', 'predicted_sentiment', 'sentiment_score', 'is_offensive', 'offensive_score']
+columns_to_show = ['text', 'predicted_sentiment', 'sentiment_score', 'is_offensive', 'offensive_score', 'emotion', 'emotion_score']
 df_final_preview = df_final[columns_to_show]

Similarly, update the sample output loop at lines 134-138 to display emotion results:

for i in range(min(3, len(final_dataset))):
    print(f"\n--- 샘플 {i+1} ---")
    print(f"원문: {final_dataset[i]['text'][:70].replace('\n', ' ')}...")
    print(f"예측 감성: {final_dataset[i]['predicted_sentiment']} (Score: {final_dataset[i]['sentiment_score']:.3f})")
    print(f"공격성 여부: {final_dataset[i]['is_offensive']} (Score: {final_dataset[i]['offensive_score']:.3f})")
    print(f"감정: {final_dataset[i]['emotion']} (Score: {final_dataset[i]['emotion_score']:.3f})")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e65b06d and b3f569d.

📒 Files selected for processing (1)
  • jimin/news.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
jimin/news.py

58-58: Multiple statements on one line (colon)

(E701)


59-59: Multiple statements on one line (colon)

(E701)


60-60: Multiple statements on one line (colon)

(E701)


90-90: Multiple statements on one line (colon)

(E701)


91-91: Multiple statements on one line (colon)

(E701)

🔇 Additional comments (3)
jimin/news.py (3)

1-16: LGTM!

The imports and device configuration are properly set up. The device selection logic correctly falls back to CPU when CUDA is not available.


17-35: LGTM!

Dataset loading and initial setup look correct. The subset selection and label information display are properly implemented.


1-10: Past review comment about display NameError is no longer applicable.

The previous review flagged a potential NameError for the display function. However, the current code does not use display() anywhere—it only uses print() statements. This past concern has been resolved.

Comment thread jimin/news.py
return example

# 데이터셋에 적용
final_dataset = processing_subset.map(analyze_offensive)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix dataset processing chain to preserve previous results.

The offensive classification results are assigned to final_dataset, but Line 120 overwrites this by mapping analyze_emotion on processing_subset (which only has sentiment results). This means the final dataset will be missing offensive classification data.

Apply this diff to fix the processing chain:

 # 데이터셋에 적용
-final_dataset = processing_subset.map(analyze_offensive)
+processing_subset = processing_subset.map(analyze_offensive)
 print("\n[작업 2] 혐오/공격성 분류 완료. 'is_offensive' 컬럼 추가.")

This ensures that emotion analysis (line 120) is applied on top of both sentiment and offensive classifications.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
final_dataset = processing_subset.map(analyze_offensive)
# 데이터셋에 적용
processing_subset = processing_subset.map(analyze_offensive)
print("\n[작업 2] 혐오/공격성 분류 완료. 'is_offensive' 컬럼 추가.")
🤖 Prompt for AI Agents
In jimin/news.py around lines 98 and 120, the offensive classification results
are assigned to final_dataset at line 98 but later line 120 maps analyze_emotion
over processing_subset, which drops the offensive labels; change the emotion
mapping to run on the accumulated final_dataset (i.e., map analyze_emotion on
final_dataset instead of processing_subset) so the pipeline composes sentiment
-> offensive -> emotion and preserves previous results.

Comment thread jimin/news.py
Comment on lines +110 to +117
def analyze_emotion(example):
"""'english_summary'의 감정을 분석하고 'emotion' 컬럼에 추가"""
# 감정 분석 파이프라인은 텍스트를 리스트 형태로 받으므로, 입력 텍스트를 리스트로 감싸야 합니다.
emotion_result = emotion_classifier(example['english_summary'])
# top_k=1 옵션으로 인해 결과는 [[{'label': '...', 'score': ...}]] 형태
example['emotion'] = emotion_result[0][0]['label']
example['emotion_score'] = emotion_result[0][0]['score'] # 점수도 함께 저장
return example
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix KeyError: 'english_summary' field does not exist in the dataset.

The analyze_emotion function attempts to access example['english_summary'], but this field is never created in the dataset. The dataset only contains the 'text' field from the tweet_eval dataset. This will cause a KeyError when the function runs.

Apply this diff to use the correct field:

 def analyze_emotion(example):
-    """'english_summary'의 감정을 분석하고 'emotion' 컬럼에 추가"""
+    """트윗 텍스트의 감정을 분석하고 'emotion' 컬럼에 추가"""
     # 감정 분석 파이프라인은 텍스트를 리스트 형태로 받으므로, 입력 텍스트를 리스트로 감싸야 합니다.
-    emotion_result = emotion_classifier(example['english_summary'])
+    emotion_result = emotion_classifier(example['text'])
     # top_k=1 옵션으로 인해 결과는 [[{'label': '...', 'score': ...}]] 형태
     example['emotion'] = emotion_result[0][0]['label']
     example['emotion_score'] = emotion_result[0][0]['score'] # 점수도 함께 저장
     return example
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def analyze_emotion(example):
"""'english_summary'의 감정을 분석하고 'emotion' 컬럼에 추가"""
# 감정 분석 파이프라인은 텍스트를 리스트 형태로 받으므로, 입력 텍스트를 리스트로 감싸야 합니다.
emotion_result = emotion_classifier(example['english_summary'])
# top_k=1 옵션으로 인해 결과는 [[{'label': '...', 'score': ...}]] 형태
example['emotion'] = emotion_result[0][0]['label']
example['emotion_score'] = emotion_result[0][0]['score'] # 점수도 함께 저장
return example
def analyze_emotion(example):
"""트윗 텍스트의 감정을 분석하고 'emotion' 컬럼에 추가"""
# 감정 분석 파이프라인은 텍스트를 리스트 형태로 받으므로, 입력 텍스트를 리스트로 감싸야 합니다.
emotion_result = emotion_classifier(example['text'])
# top_k=1 옵션으로 인해 결과는 [[{'label': '...', 'score': ...}]] 형태
example['emotion'] = emotion_result[0][0]['label']
example['emotion_score'] = emotion_result[0][0]['score'] # 점수도 함께 저장
return example
🤖 Prompt for AI Agents
In jimin/news.py around lines 110-117, the function accesses a non-existent key
'english_summary' causing KeyError; change it to use the dataset's actual field
'text', pass that text as a single-item list into the emotion classifier, and
assign the returned label and score into example['emotion'] and
example['emotion_score']; also guard against missing text by using
example.get('text', '') or returning the example unchanged if no text is
present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant