Description
Checked other resources
- I added a very descriptive title to this issue.
- I searched the LangChain.js documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain.js rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
Reproduce the bug use the below code:
import { TextLoader } from 'langchain/document_loaders/fs/text';
import { writeFileSync } from 'fs';
async function main() {
const filePath = "src/sample.xsd"
const sampleText = `<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Employee">
<xs:complexType>
<xs:sequence>
<xs:element name="Name" type="xs:string"/>
<xs:element name="Department" type="xs:string"/>
<xs:element name="Salary" type="xs:decimal"/>
</xs:sequence>
<xs:attribute name="id" type="xs:integer" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
`
writeFileSync(filePath, sampleText, { encoding: "utf16le" })
const textLoader = new TextLoader(filePath)
const documents = await textLoader.load();
console.log([documents[0].pageContent])
}
main()
Error Message and Stack Trace (if applicable)
[
'<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00 \x00e\x00n\x00c\x00o\x00d\x00i\x00n\x00g\x00=\x00"\x00U\x00T\x00F\x00-\x008\x00"\x00?\x00>\x00\n' +
'\x00<\x00x\x00s\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00x\x00s\x00=\x00"\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00w\x00w\x00w\x00.\x00w\x003\x00.\x00o\x00r\x00g\x00/\x002\x000\x000\x001\x00/\x00X\x00M\x00L\x00S\x00c\x00h\x00e\x00m\x00a\x00"\x00>\x00\n' +
'\x00\n' +
'\x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00E\x00m\x00p\x00l\x00o\x00y\x00e\x00e\x00"\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00T\x00y\x00p\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00s\x00e\x00q\x00u\x00e\x00n\x00c\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00N\x00a\x00m\x00e\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00s\x00t\x00r\x00i\x00n\x00g\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00D\x00e\x00p\x00a\x00r\x00t\x00m\x00e\x00n\x00t\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00s\x00t\x00r\x00i\x00n\x00g\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00S\x00a\x00l\x00a\x00r\x00y\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00d\x00e\x00c\x00i\x00m\x00a\x00l\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00s\x00e\x00q\x00u\x00e\x00n\x00c\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00a\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00i\x00d\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00i\x00n\x00t\x00e\x00g\x00e\x00r\x00"\x00 \x00u\x00s\x00e\x00=\x00"\x00r\x00e\x00q\x00u\x00i\x00r\x00e\x00d\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00T\x00y\x00p\x00e\x00>\x00\n' +
'\x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00>\x00\n' +
'\x00\n' +
'\x00<\x00/\x00x\x00s\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00>\x00\n' +
'\x00'
]
Description
When a file is written in the utf16le
encoding, there is a problem with TextLoader
loading the document in the same format. The current implementation of TextLoader
only supports utf8
and doesn't provide a way to override or automatically detect the encoding, unlike the Python version.
Python version of TextLoader
: https://github.com/langchain-ai/langchain/blob/b075eab3e0af9a578af80c6e38f869419e770b5c/libs/community/langchain_community/document_loaders/text.py#L46
Java Script version of TextLoader
:
System Info
Node Version: v20.18.2
Platform: Mac OS Sonoma 14.0
Language: Typescript