Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: fix CommaSeparatedListOutputParser to handle columns that may contain commas in it #26365

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jkyamog
Copy link

@jkyamog jkyamog commented Sep 12, 2024

  • Description:
    Currently CommaSeparatedListOutputParser can't handle strings that may contain commas within a column. It would parse any commas as the delimiter.
    Ex.
    "foo, foo2", "bar", "baz"

It will create 4 columns: "foo", "foo2", "bar", "baz"

This should be 3 columns:

"foo, foo2", "bar", "baz"

  • Dependencies:
    Added 2 additional imports, but they are built in python packages.

import csv
from io import StringIO

  • Twitter handle: @jkyamog

  • Add tests and docs:

  1. added simple unit test test_multiple_items_with_comma

Copy link

vercel bot commented Sep 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 17, 2024 9:35pm

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. Ɑ: core Related to langchain-core 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Sep 12, 2024
Copy link
Member

@efriis efriis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! This is a breaking change in certain cases. Would it be possible to only use the new csv-based parsing logic in the event some flag is set on the output parser? e.g. CommaSeparatedListOutputParser(parse_quoted_cells=True)

@jkyamog
Copy link
Author

jkyamog commented Sep 17, 2024

Hey there! This is a breaking change in certain cases. Would it be possible to only use the new csv-based parsing logic in the event some flag is set on the output parser? e.g. CommaSeparatedListOutputParser(parse_quoted_cells=True)

Yes I can definitely just make it a flag. I didn't realize it was breaking other use cases. I did run the test on this, and I added 1 additional test. So my understanding it covered the previous cases. Before adding a new flag, I would like to see if it's possible to cover for all identified use cases? Can you give me an example where it gets broken?

For you convenience here are the current tests are:

test_single_item
    text = "foo"
    expected = ["foo"]

test_multiple_items_with_spaces
    text = "foo, bar, baz"
    expected = ["foo", "bar", "baz"]

test_multiple_items
    text = "foo,bar,baz"
    expected = ["foo", "bar", "baz"]

test_multiple_items_with_comma <- new test that I added based on the previous existing one.
    text = '"foo, foo2",bar,baz'
    expected = ["foo, foo2", "bar", "baz"]

Can you tell me what cases where it fails?

text = "FAILING CASE"
expected = ["foo", "bar" .....]

@efriis
Copy link
Member

efriis commented Sep 19, 2024

test_multiple_items_with_comma_existing_behavior
text = '" is a double quote character,another double quote character is ",this cell has no double quote'
expected = ['"this is a double quote character', 'another double quote character is "', 'this cell has no double quote']

note that the behavior will be different using the new flag (then it will match your test)

@jkyamog
Copy link
Author

jkyamog commented Sep 23, 2024

test_multiple_items_with_comma_existing_behavior text = '" is a double quote character,another double quote character is ",this cell has no double quote' expected = ['"this is a double quote character', 'another double quote character is "', 'this cell has no double quote']

note that the behavior will be different using the new flag (then it will match your test)

Sorry I only looked at this right now. Adding the flag on class level CommaSeparatedListOutputParser(parse_quoted_cells=True) it gives a lot of issues of existing test/code can't seem to instantiate it. Not sure, but maybe because it is being serialized and pydantic is not aware of the attribute parse_quoted_cells?

I then tried to see if I can just add the flag on the method level.

    assert parser.parse(text,parse_quoted_cells=True) == expected
    assert add(parser.transform(t for t in text)) == expected

This works on the .parse, but fails on .transform. I am not sure what is the purpose of .transform, I guess that is how its called down the chain? Appreciate some guidance on this, my initial thoughts that this would be simple change and even keeping the original behavior but I might be missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: core Related to langchain-core size:S This PR changes 10-29 lines, ignoring generated files.
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

2 participants