Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_json_markdown is unable to parse json strings with nested triple backticks #5428

Closed
2 of 14 tasks
schinto opened this issue May 30, 2023 · 9 comments · May be fixed by #26255
Closed
2 of 14 tasks

parse_json_markdown is unable to parse json strings with nested triple backticks #5428

schinto opened this issue May 30, 2023 · 9 comments · May be fixed by #26255

Comments

@schinto
Copy link

schinto commented May 30, 2023

System Info

Langchain version 0.0.184, python 3.9.13
Function parse_json_markdown in langchain/output_parsers/json.py fails with input text string:
```json
{
"action": "Final Answer",
"action_input": "Here's a Python script to remove backticks at the beginning and end of a string:\n\n```python\ndef remove_backticks(s):\n return s.strip('`')\n\nstring_with_backticks = '`example string`'\nresult = remove_backticks(string_with_backticks)\nprint(result)\n```\n\nThis script defines a function called `remove_backticks` that takes a string as input and returns a new string with backticks removed from the beginning and end. It then demonstrates how to use the function with an example string."
}
```

Potential case of error:
match.group(2) in the function parse_json_markdown contains only the string up to the first occurrence of the second triple backticks:

{
"action": "Final Answer",
"action_input": "Here's a Python script to remove backticks at the beginning and end of a string:\n\n

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

Called function parse_json_markdown in langchain/output_parsers/json.py with input text string:
```json
{
"action": "Final Answer",
"action_input": "Here's a Python script to remove backticks at the beginning and end of a string:\n\n```python\ndef remove_backticks(s):\n return s.strip('`')\n\nstring_with_backticks = '`example string`'\nresult = remove_backticks(string_with_backticks)\nprint(result)\n```\n\nThis script defines a function called `remove_backticks` that takes a string as input and returns a new string with backticks removed from the beginning and end. It then demonstrates how to use the function with an example string."
}
```

Expected behavior

Function parse_json_markdown should return the following json string
{
"action": "Final Answer",
"action_input": "Here's a Python script to remove backticks at the beginning and end of a string:\n\n```python\ndef remove_backticks(s):\n return s.strip('`')\n\nstring_with_backticks = '`example string`'\nresult = remove_backticks(string_with_backticks)\nprint(result)\n```\n\nThis script defines a function called `remove_backticks` that takes a string as input and returns a new string with backticks removed from the beginning and end. It then demonstrates how to use the function with an example string."
}

@schinto
Copy link
Author

schinto commented May 31, 2023

Proposed fix:

def parse_json_markdown(json_string: str) -> dict:
    # Try to find JSON string within first and last triple backticks
    match = re.search(r"""```       # match first occuring triple backticks
                          (?:json)? # zero or one match of string json in non-capturing group
                          (.*)```   # greedy match to last triple backticks""", json_string, flags=re.DOTALL|re.VERBOSE)

    # If no match found, assume the entire string is a JSON string
    if match is None:
        json_str = json_string
    else:
        # If match found, use the content within the backticks
        json_str = match.group(1)

    # Strip whitespace and newlines from the start and end
    json_str = json_str.strip()

    # Parse the JSON string into a Python dictionary while allowing control characters by setting strict to False
    parsed = json.loads(json_str, strict=False)

    return parsed

@yassineselmi
Copy link

yassineselmi commented Aug 11, 2023

@schinto It looks like the proposed fix doesn't work as well.

I have this output returned by the LLM:

{
    "action": "Final Answer",
    "action_input": "Sure! Here's an example Python code to create an S3 bucket using the Boto3 library:\n\n```python\nimport boto3\n\n# Create an S3 client\ns3 = boto3.client('s3')\n\n# Create a new S3 bucket\nbucket_name = 'your-bucket-name'\ns3.create_bucket(Bucket=bucket_name)\n\n# Print the bucket creation status\nresponse = s3.list_buckets()\nfor bucket in response['Buckets']:\n    if bucket['Name'] == bucket_name:\n        print('Bucket created successfully!')\n        break\n```"
}

And the regex is always matching the second triple backticks ( ```python ...). Consequently, the json_str value is the python code in this case, which is absolutely not a json to be loaded with json.loads().

@schinto
Copy link
Author

schinto commented Aug 11, 2023

@yassineselmi the output by the LLM should be enclosed by triple backticks like

```json
{
"action": "Final Answer",
"action_input": "text with code block ```python\nimport boto3 \n\n```"
}
```

If these are missing, then the parse_json_markdown function may need further changes.

@i-arnab
Copy link

i-arnab commented Sep 7, 2023

Hi Team,

I am facing a similar issue while using GraphSparqlQAChain langchain llm with RDF Graph Data. The model is able to create correct SPARQL queries with correct Intent now but they are enclosed in triple backticks (```). As a result the SPARQL query execution is failing and no insights are generated from prompts.

the generated SPARQL looks like :
(triple-backticks)<sparql-query>(triple-backticks)
And the final error message looks like :

ParseException: Expected {SelectQuery | ConstructQuery | DescribeQuery | AskQuery}, found '`' (at char 0), (line:1, col:1)

Can anyone kindly help me with this?

Copy link

dosubot bot commented Dec 7, 2023

Hi, @schinto

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the parse_json_markdown function in langchain's json.py is failing to parse JSON strings with nested triple backticks. There was a proposed fix, but it was pointed out that the fix did not work as expected. Additionally, another user mentioned facing a similar issue with GraphSparqlQAChain langchain llm, where the model was creating SPARQL queries enclosed in triple backticks, causing SPARQL query execution to fail.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 7, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 14, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 14, 2023
@franciscbalint
Copy link

Having this issue right now, anyone has a fix for this?

@eoinm24
Copy link

eoinm24 commented Feb 13, 2024

Having this issue right now, anyone has a fix for this?

Also still having this issue

@live2awesome
Copy link

Still facing the issue the JsonOutputParser(pydantic_object=JobDescriptionInfoExtract) the jsonoutputparser doesn't work properly with pydantic

@krishnakumar18
Copy link

@yassineselmi the output by the LLM should be enclosed by triple backticks like

json { "action": "Final Answer", "action_input": "text with code block python\nimport boto3 \n\n" }

If these are missing, then the parse_json_markdown function may need further changes.

Hey there, do you know how to make a langchain agent always return json response in this format? by enclosing it in triple ticks?

LiMingchen159 added a commit to LiMingchen159/langchain that referenced this issue Sep 10, 2024
LiMingchen159 added a commit to LiMingchen159/langchain that referenced this issue Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants