Skip to content

Add source names (via new Stream and SourceSpan classes) and .span() combinator #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

tsani
Copy link

@tsani tsani commented Jan 19, 2025

Following from #82:

The optional filename should probably be named 'source' and be documented something like "a string representing the source document e.g. a filename". This allows for other things e.g. URLs.

I've gone ahead with the name source. That makes the most sense to me as it could be something more abstract like <stdin> or a URL as you mentioned.

I haven't thought this through, but perhaps if the source is present, then mark() should automatically add it as a 4th element in the tuple.

I opted against changing mark at all, since this would cause parsers involving it to break when parsing a data stream equipped with a source.

Or, we could have mark_source() which would have the new behaviour and leave mark() as it is. This would allow us to return something more useful like a namedtuple so that there are named fields. In this case mark() would be soft deprecated (i.e. no warnings, just docs that suggest using mark_source() instead.

This approach seemed the best to me. I created a dataclass SourceSpan to hold the start&end row&column alongside the source, adjusted the line_info and line_info_at helpers to account for the source, and introduced the method .span() as the improved version of .mark() to augment the result of the parser with a SourceSpan object.

The tricky part of the PR was that it wasn't so simple to "just thread a source name through the parsers." The parser objects themselves are completely stateless -- all the state is held within the data stream, which is just a string, list, or bytes object.

I created a class Stream to wrap the underlying data stream, making it possible to add extra fields; in this case, that's just source. Then .parse() takes a Stream instead of "raw" data. This does create a breaking change in the API, as anyone calling parse must pass a Stream as input. (Fixing the tests to account for this was super fun.)

I believe that introducing Stream is important for the future since it's common for parsers to work on data in a truly streaming fashion. The current design of parsy requires all the data to parse to be buffered upfront, so adding genuine streaming will take a lot more effort.

To eliminate the API breakage, I added a somewhat ugly isinstance check in .parse() to convert to a Stream (with no source name) when the user provides something else, so that this can be a patch release instead of a minor release.

@tsani tsani changed the title Feature/source span and stream Add source names (via new Stream and SourceSpan classes) and .span() combinator Jan 19, 2025
@tsani tsani force-pushed the feature/source-span-and-stream branch from e64a7f5 to 6579cae Compare January 19, 2025 14:41
This primarily wraps the str|bytes|list that is the data to parse, but
also adds the metadata `source` to hold a filename, URL, etc. where the
data is from.

Introducing this class also paves the way for eventually supporting
streaming input data.
@tsani tsani force-pushed the feature/source-span-and-stream branch from cf9c189 to 58b00dd Compare January 19, 2025 14:53
@tsani tsani force-pushed the feature/source-span-and-stream branch from 58b00dd to 52ac956 Compare January 19, 2025 14:56
@tsani
Copy link
Author

tsani commented Jan 19, 2025

Sorry about the million force-pushes to my local branch obsuring the history above. I wasn't able to get tox working locally, so I debugged using GH actions on my fork. Everything should be good to go now.

I made a couple of changes to the workflows: namely removing python 3.7 which is de jure unsupported by parsy at this point, and is unavailable in GH actions now anyway. I added runners for python 3.12 and 3.13, for which I added a kludge in setup.py since setuptools isn't part of python 3.12 onward.

@spookylukey
Copy link
Member

Hi @tsani - thanks so much for your work on this.

At this point in its life, parsy is a pretty mature library, so breaking API compatibility is a really big deal, and not something that I would consider at this point for this feature. Breaking the parse() entry point especially is not something we want to do, since it affects literally every user.

However, I think there shouldn't be a need to do that.

The first thing I think we can do is add a source keyword argument to parse() and parse_partial(). Those functions can then be responsible for creating the Stream object where necessary.

Somewhat harder is that we need to keep the interface for Parser constructor and specifically the @Parsy usage, like this in the docs:

def consume(n):

    @Parser
    def consumer(stream, index):
        items = stream[index:index + n]
        if len(items) == n:
            return Result.success(index + n, items)
        else:
            return Result.failure(index, "{0} items".format(n))

    return consumer

This means that stream here needs to continue to behave as before, which means it really needs to be a str instance (or bytes, if that was passed). Mimicking the __getitem__ interface makes it work for the example above, but in general won't be enough - there will be a lot of code that relies on it being a str or bytes and uses other methods.

This is a harder constraint, but there are some ways forward:

  • first, could make it so that we only wrap in Stream if that parse(source=...) keyword arg is passed. This means that existing code that doesn't pass that arg runs exactly the same as before.
  • second, we could get a bit tricky and make our Stream object subclass str or bytes - you might been StrStream and BytesStream, do some isinstance checks, and complain if some other type of objects is passed along with source=...

Proof of concept code:

class StrStream(str):
    def __new__(cls, string, source):
        instance = super().__new__(cls, string)
        instance.source = source
        return instance


>>> s = StrStream('some text', 'myfile.txt')
>>> s
'some text'

>>> s.split(' ')
['some', 'text']

>>> s.source
'myfile.txt'

>>> isinstance(s, str)
True

My expectation is that there shouldn't be any need to change any of the existing test suite - it should run without modification, any breakage is telling us that we've possibly broken someone else's code too.

@spookylukey
Copy link
Member

BTW - I have done some work on the CI etc., and switched to uv for packaging, and merged that to master, so those parts of the PR shouldn't be needed any more. You might want to start a new branch and cherry-pick what you need. Sorry for the extra work!

@tsani
Copy link
Author

tsani commented Jan 23, 2025

Hey @spookylukey, thanks for the input on this! I made a new PR with the changes here: #85

@tsani tsani closed this Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants