Add source names (via new `Stream` and `SourceSpan` classes) and `.span()` combinator #83

tsani · 2025-01-19T14:33:12Z

Following from #82:

The optional filename should probably be named 'source' and be documented something like "a string representing the source document e.g. a filename". This allows for other things e.g. URLs.

I've gone ahead with the name source. That makes the most sense to me as it could be something more abstract like <stdin> or a URL as you mentioned.

I haven't thought this through, but perhaps if the source is present, then mark() should automatically add it as a 4th element in the tuple.

I opted against changing mark at all, since this would cause parsers involving it to break when parsing a data stream equipped with a source.

Or, we could have mark_source() which would have the new behaviour and leave mark() as it is. This would allow us to return something more useful like a namedtuple so that there are named fields. In this case mark() would be soft deprecated (i.e. no warnings, just docs that suggest using mark_source() instead.

This approach seemed the best to me. I created a dataclass SourceSpan to hold the start&end row&column alongside the source, adjusted the line_info and line_info_at helpers to account for the source, and introduced the method .span() as the improved version of .mark() to augment the result of the parser with a SourceSpan object.

The tricky part of the PR was that it wasn't so simple to "just thread a source name through the parsers." The parser objects themselves are completely stateless -- all the state is held within the data stream, which is just a string, list, or bytes object.

I created a class Stream to wrap the underlying data stream, making it possible to add extra fields; in this case, that's just source. Then .parse() takes a Stream instead of "raw" data. This does create a breaking change in the API, as anyone calling parse must pass a Stream as input. (Fixing the tests to account for this was super fun.)

I believe that introducing Stream is important for the future since it's common for parsers to work on data in a truly streaming fashion. The current design of parsy requires all the data to parse to be buffered upfront, so adding genuine streaming will take a lot more effort.

To eliminate the API breakage, I added a somewhat ugly isinstance check in .parse() to convert to a Stream (with no source name) when the user provides something else, so that this can be a patch release instead of a minor release.

This primarily wraps the str|bytes|list that is the data to parse, but also adds the metadata `source` to hold a filename, URL, etc. where the data is from. Introducing this class also paves the way for eventually supporting streaming input data.

Wrap the string, bytes, list into a Stream before calling parse.

tsani · 2025-01-19T15:02:03Z

Sorry about the million force-pushes to my local branch obsuring the history above. I wasn't able to get tox working locally, so I debugged using GH actions on my fork. Everything should be good to go now.

I made a couple of changes to the workflows: namely removing python 3.7 which is de jure unsupported by parsy at this point, and is unavailable in GH actions now anyway. I added runners for python 3.12 and 3.13, for which I added a kludge in setup.py since setuptools isn't part of python 3.12 onward.

spookylukey · 2025-01-22T20:49:32Z

Hi @tsani - thanks so much for your work on this.

At this point in its life, parsy is a pretty mature library, so breaking API compatibility is a really big deal, and not something that I would consider at this point for this feature. Breaking the parse() entry point especially is not something we want to do, since it affects literally every user.

However, I think there shouldn't be a need to do that.

The first thing I think we can do is add a source keyword argument to parse() and parse_partial(). Those functions can then be responsible for creating the Stream object where necessary.

Somewhat harder is that we need to keep the interface for Parser constructor and specifically the @Parsy usage, like this in the docs:

def consume(n):

    @Parser
    def consumer(stream, index):
        items = stream[index:index + n]
        if len(items) == n:
            return Result.success(index + n, items)
        else:
            return Result.failure(index, "{0} items".format(n))

    return consumer

This means that stream here needs to continue to behave as before, which means it really needs to be a str instance (or bytes, if that was passed). Mimicking the __getitem__ interface makes it work for the example above, but in general won't be enough - there will be a lot of code that relies on it being a str or bytes and uses other methods.

This is a harder constraint, but there are some ways forward:

first, could make it so that we only wrap in Stream if that parse(source=...) keyword arg is passed. This means that existing code that doesn't pass that arg runs exactly the same as before.
second, we could get a bit tricky and make our Stream object subclass str or bytes - you might been StrStream and BytesStream, do some isinstance checks, and complain if some other type of objects is passed along with source=...

Proof of concept code:

class StrStream(str):
    def __new__(cls, string, source):
        instance = super().__new__(cls, string)
        instance.source = source
        return instance


>>> s = StrStream('some text', 'myfile.txt')
>>> s
'some text'

>>> s.split(' ')
['some', 'text']

>>> s.source
'myfile.txt'

>>> isinstance(s, str)
True

My expectation is that there shouldn't be any need to change any of the existing test suite - it should run without modification, any breakage is telling us that we've possibly broken someone else's code too.

spookylukey · 2025-01-22T21:17:57Z

BTW - I have done some work on the CI etc., and switched to uv for packaging, and merged that to master, so those parts of the PR shouldn't be needed any more. You might want to start a new branch and cherry-pick what you need. Sorry for the extra work!

tsani · 2025-01-23T13:12:50Z

Hey @spookylukey, thanks for the input on this! I made a new PR with the changes here: #85

tsani changed the title ~~Feature/source span and stream~~ Add source names (via new Stream and SourceSpan classes) and .span() combinator Jan 19, 2025

tsani force-pushed the feature/source-span-and-stream branch from e64a7f5 to 6579cae Compare January 19, 2025 14:41

add Stream to represent input data

ad1f90f

This primarily wraps the str|bytes|list that is the data to parse, but also adds the metadata `source` to hold a filename, URL, etc. where the data is from. Introducing this class also paves the way for eventually supporting streaming input data.

tsani force-pushed the feature/source-span-and-stream branch from cf9c189 to 58b00dd Compare January 19, 2025 14:53

tsani added 8 commits January 19, 2025 09:55

fix all existing tests to support Stream

114e5ef

Wrap the string, bytes, list into a Stream before calling parse.

document Stream

b81c83b

allow calling .parse() with legacy streams

a475321

add .span() method

7dd0a91

document .span() and SourceSpan

1c4856f

add .span() test

d4c2586

adjust line_info tests to include sources

d384337

actions: drop unsupported py37 runner

52ac956

tsani force-pushed the feature/source-span-and-stream branch from 58b00dd to 52ac956 Compare January 19, 2025 14:56

tsani added 2 commits January 19, 2025 09:57

actions: add runners for 3.13, 3.13

f049377

make setup.py a no-op on python>=3.12

ef9b425

tsani mentioned this pull request Jan 23, 2025

Add source names (via new Stream and SourceSpan classes) and .span() combinator (take 2) #85

Open

tsani closed this Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add source names (via new `Stream` and `SourceSpan` classes) and `.span()` combinator #83

Add source names (via new `Stream` and `SourceSpan` classes) and `.span()` combinator #83

Uh oh!

tsani commented Jan 19, 2025 •

edited

Loading

Uh oh!

tsani commented Jan 19, 2025 •

edited

Loading

Uh oh!

spookylukey commented Jan 22, 2025

Uh oh!

spookylukey commented Jan 22, 2025

Uh oh!

tsani commented Jan 23, 2025

Uh oh!

Uh oh!

Add source names (via new Stream and SourceSpan classes) and .span() combinator #83

Add source names (via new Stream and SourceSpan classes) and .span() combinator #83

Uh oh!

Conversation

tsani commented Jan 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsani commented Jan 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spookylukey commented Jan 22, 2025

Uh oh!

spookylukey commented Jan 22, 2025

Uh oh!

tsani commented Jan 23, 2025

Uh oh!

Uh oh!

Add source names (via new `Stream` and `SourceSpan` classes) and `.span()` combinator #83

Add source names (via new `Stream` and `SourceSpan` classes) and `.span()` combinator #83

tsani commented Jan 19, 2025 •

edited

Loading

tsani commented Jan 19, 2025 •

edited

Loading