Statement Normalization Pipeline using OpenAI by dankim444 · Pull Request #24 · Watts-Lab/commonsense-statements

dankim444 · 2024-08-17T18:28:48Z

Description

This PR introduces a new statement normalization pipeline, cleans the remaining original statements in the raw_statements directory, and introduces minor changes to different files to streamline text extraction (specifically extracting the language code) from filenames. The criteria for normalization is as follows:

The first letter of the statement must be capitalized (if applicable to the language).
Leading and trailing punctuation is removed.
The statement ends in the appropriate full-stop punctuation native to the language.

The normalization pipeline leverages OpenAI, and the news_statements and observable statements were cleaned using gpt-4o while email_statements (due to the size of the files) files were cleaned with gpt-3.5-turbo. During this process, I noticed several differences in performance between the two models. Specifically, gpt-4o was more consistent in not changing the original capitalization of proper nouns, altering the original vocabulary, and not introducing any additional punctuation; whereas gpt-3.5-turbo would make changes despite being explicitly instructed not to in the system prompt. When merged, this PR will close Watts-Lab/commonsense-platform#150, ensuring consistent rendering of statements on the commonsense platform's UI.

New files

normalize_statements_openai.py: script that cleans statements files that have yet to be cleaned in the raw_statements directory.
remove_duplicates_after_normalization.py: script that handles duplicates caused by running the normalize_statements_openai.py script.

Changes

email_statements, news_statements, observable statements
Translate Statements and Remove Any Duplicates workflow: Added a third job 'normalize-statements' that cleans the statement files after they have been translated and removes potential duplicates from translations.
calculate_translation_cost.py: updated the way the language code is extracted from the filename and how filenames are processed.
remove_duplicates.py: minor change to documentation.
show_groups_of_duplicates: removed 'lng' as a column to avoid redundancy.
translate_statements_aws.py: changed how filenames are processed and how language code is extracted.
README.md: included instructions on naming convention of files and translation of files.

Testing

I acted as a "human-in-the-loop" to verify OpenAI's outputs. I used an online Diffchecker tool (https://www.diffchecker.com/) to compare changes made from the original file to the new file. I also used OpenAI playground to verify the system prompt.

Important note

To ensure more consistent output from OpenAI, I recommend using gpt-4o or possibly gpt-4o-mini to normalize the statements. In particular, gpt-3.5-turbo would sometimes remove the capitalization of proper nouns, alter some vocabulary and thereby change the nuanced meaning of some statements, and introduce unintended punctuation. I directly address all these in the system prompt; however, it is open to improvement.

…to 150-inconsistent-statements

Files changed: M raw_statements/email_statements.csv M raw_statements/email_statements_ar.csv M raw_statements/email_statements_bn.csv M raw_statements/email_statements_es.csv M raw_statements/email_statements_fr.csv M raw_statements/email_statements_hi.csv M raw_statements/email_statements_ja.csv M raw_statements/email_statements_pt.csv M raw_statements/email_statements_ru.csv M raw_statements/email_statements_zh.csv M raw_statements/news_statements_amir.csv M raw_statements/news_statements_amir_ar.csv M raw_statements/news_statements_amir_hi.csv M raw_statements/observable_gpt4o_ar.csv

…to 150-inconsistent-statements

…d_statements

markwhiting · 2024-08-19T13:39:50Z

Great. Can we switch to 4o for everything? (or have you already)

github-actions · 2024-10-03T10:24:43Z

Translation Cost Calculation

cleaned_statements_en.csv still needs to be translated into 9 new languages. This would require translating 12141 characters.
It will cost approximately $0.18 to complete these translations.

dankim444 and others added 22 commits July 18, 2024 14:53

create clean_statements.py script

d52ae27

Merge branch 'main' into 150-inconsistent-statements

32750da

Merge branch 'main' of github.com:Watts-Lab/commonsense-statements in…

9cb6e6a

…to 150-inconsistent-statements

add script for normalizing statements

879df18

add workflows

2299a5e

standardize statements

aa84242

remove redundant statement files

85e0dde

remove duplicates caused by normalization

e8bfbfa

update workflows

8488b26

implement openai normalization script

fc390f3

normalize news statements and observable statements

d4ca500

integrate normalization into workflows

993d7f3

fix format checking error

a2e38cf

update requirements

fcf2066

clean email statements and change model to 3.5-turbo

5adce1d

update calculate_translation_cost.py

dc111dd

Merge branch 'main' of github.com:Watts-Lab/commonsense-statements in…

18b5d6b

…to 150-inconsistent-statements

update calculate translation script

5805084

minor changes

de3996f

add english language code as suffix to original_statements and cleane…

9476833

…d_statements

update readme

15fdcae

dankim444 linked an issue Aug 17, 2024 that may be closed by this pull request

Treatment of statements is inconsistent Watts-Lab/commonsense-platform#150

Closed

dankim444 requested a review from amirrr August 17, 2024 18:28

dankim444 and others added 2 commits August 19, 2024 14:51

switch model from 3.5-turbo to 4o

6847958

Merge branch 'main' into 150-inconsistent-statements

de6dfcc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statement Normalization Pipeline using OpenAI#24

Statement Normalization Pipeline using OpenAI#24
dankim444 wants to merge 24 commits intomainfrom
150-inconsistent-statements

dankim444 commented Aug 17, 2024

Uh oh!

markwhiting commented Aug 19, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Oct 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dankim444 commented Aug 17, 2024

Description

New files

Changes

Testing

Important note

Uh oh!

markwhiting commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 3, 2024

Translation Cost Calculation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

markwhiting commented Aug 19, 2024 •

edited

Loading