Skip to content

Invalid escape sequences in regex in the pipeline YAML #11093

@sarahkiener

Description

@sarahkiener

Bug Description
When serializing a pipeline containing regex patterns to YAML using pipe.dumps() and then deserializing it with Pipeline.loads(), the operation fails due to invalid escape sequences in the generated YAML.

Steps to Reproduce

  1. Create a pipeline using the Document Cleaner example from the documentation
  2. Serialize the pipeline to YAML using pipe.dumps()
  3. Deserialize the YAML back to a pipeline using Pipeline.loads()
  4. Run the pipeline

The deserialization fails with an error about unexpected characters, e.g., #x0008, caused by YAML interpreting \b (intended as a regex word boundary) as a backspace character escape sequence.

The error is caused by pipe.dumps() that generates a YAML with single backslashes in regex patterns (e.g., \b, \w), which YAML interprets as escape sequences. To represent literal backslashes in regex patterns, YAML requires double backslashes (\\b, \\w).

Proposed Solution
The pipe.dumps() method should properly escape backslashes in regex strings with a double backslash when generating YAML output.

Expected behavior
The pipeline should serialize to valid YAML and deserialize successfully without errors.

Error message
SyntaxWarning: "\w" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\w"? A raw string is also an option.
bm25_tokenization_regex: '(?u)\b\w+\b'
Traceback (most recent call last):
File "/haystack/core/pipeline/base.py", line 308, in loads
deserialized_data = marshaller.unmarshal(data)
File "/haystack/lib/python3.14/site-packages/haystack/marshal/yaml.py", line 40, in unmarshal
return yaml.load(data_, Loader=YamlLoader)
~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/haystack/lib/python3.14/site-packages/yaml/init.py", line 79, in load
loader = Loader(stream)
File "/haystack/lib/python3.14/site-packages/yaml/loader.py", line 34, in init
Reader.init(self, stream)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/haystack/lib/python3.14/site-packages/yaml/reader.py", line 74, in init
self.check_printable(stream)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^
File "/haystack/lib/python3.14/site-packages/yaml/reader.py", line 143, in check_printable
raise ReaderError(self.name, position, ord(character),
'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #x0008: special characters are not allowed
in "", position 1121

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/haystack/lib/python3.14/site-packages/haystack/core/pipeline/base.py", line 310, in loads
raise DeserializationError(
...<2 lines>...
) from e
haystack.core.errors.DeserializationError: Error while unmarshalling serialized pipeline data. This is usually caused by malformed or invalid syntax in the serialized representation.

Metadata

Metadata

Assignees

Labels

P1High priority, add to the next sprint

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions