Skip to content

TextLoader is restricted to UTF-8 file encoding format and doesn't support dynamic encoding similar to Python Version #7923

Open
@gangadharrr

Description

@gangadharrr

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Reproduce the bug use the below code:

import { TextLoader } from 'langchain/document_loaders/fs/text';
import { writeFileSync } from 'fs';

async function main() {
    const filePath = "src/sample.xsd"
    const sampleText = `<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:element name="Employee">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Name" type="xs:string"/>
        <xs:element name="Department" type="xs:string"/>
        <xs:element name="Salary" type="xs:decimal"/>
      </xs:sequence>
      <xs:attribute name="id" type="xs:integer" use="required"/>
    </xs:complexType>
  </xs:element>

</xs:schema>
`
    writeFileSync(filePath, sampleText, { encoding: "utf16le" })

    const textLoader = new TextLoader(filePath)
    const documents = await textLoader.load();
    console.log([documents[0].pageContent])
}
main()

Error Message and Stack Trace (if applicable)

[
'<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00 \x00e\x00n\x00c\x00o\x00d\x00i\x00n\x00g\x00=\x00"\x00U\x00T\x00F\x00-\x008\x00"\x00?\x00>\x00\n' +
'\x00<\x00x\x00s\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00x\x00s\x00=\x00"\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00w\x00w\x00w\x00.\x00w\x003\x00.\x00o\x00r\x00g\x00/\x002\x000\x000\x001\x00/\x00X\x00M\x00L\x00S\x00c\x00h\x00e\x00m\x00a\x00"\x00>\x00\n' +
'\x00\n' +
'\x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00E\x00m\x00p\x00l\x00o\x00y\x00e\x00e\x00"\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00T\x00y\x00p\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00s\x00e\x00q\x00u\x00e\x00n\x00c\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00N\x00a\x00m\x00e\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00s\x00t\x00r\x00i\x00n\x00g\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00D\x00e\x00p\x00a\x00r\x00t\x00m\x00e\x00n\x00t\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00s\x00t\x00r\x00i\x00n\x00g\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00S\x00a\x00l\x00a\x00r\x00y\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00d\x00e\x00c\x00i\x00m\x00a\x00l\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00s\x00e\x00q\x00u\x00e\x00n\x00c\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00a\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00i\x00d\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00i\x00n\x00t\x00e\x00g\x00e\x00r\x00"\x00 \x00u\x00s\x00e\x00=\x00"\x00r\x00e\x00q\x00u\x00i\x00r\x00e\x00d\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00T\x00y\x00p\x00e\x00>\x00\n' +
'\x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00>\x00\n' +
'\x00\n' +
'\x00<\x00/\x00x\x00s\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00>\x00\n' +
'\x00'
]

LangChainbugReport.zip

Description

When a file is written in the utf16le encoding, there is a problem with TextLoader loading the document in the same format. The current implementation of TextLoader only supports utf8 and doesn't provide a way to override or automatically detect the encoding, unlike the Python version.

Python version of TextLoader: https://github.com/langchain-ai/langchain/blob/b075eab3e0af9a578af80c6e38f869419e770b5c/libs/community/langchain_community/document_loaders/text.py#L46

Java Script version of TextLoader:

text = await readFile(this.filePathOrBlob, "utf8");

System Info

Node Version: v20.18.2
Platform: Mac OS Sonoma 14.0
Language: Typescript

Metadata

Metadata

Assignees

No one assigned

    Labels

    auto:bugRelated to a bug, vulnerability, unexpected error with an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions