Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermediate transform stage between read and write phases #12698

Open
bpatath opened this issue Jul 27, 2024 · 5 comments
Open

Intermediate transform stage between read and write phases #12698

bpatath opened this issue Jul 27, 2024 · 5 comments
Labels
type:proposal a feature suggestion

Comments

@bpatath
Copy link

bpatath commented Jul 27, 2024

Is your feature request related to a problem? Please describe.

Yes. I am writing a Sphinx extension and here is a simplified version of my issue:

  1. I want some directive in file A to generate some things in the environment.
    .. create_thing:: foo
    .. create_thing:: bar
    
  2. I want some directive in file B to display a list all of these things.
    .. list_things::
    
  3. I want some role in file C to be able to reference the things rendered in B.
    A link to :thing:`bar` in file B.
    

The first step is typically done in parallel in the read phase, where I can fill the environment, either directly in directive's run, or using special nodes and Transform.

The third step is typically done in parallel in the write phase, where I can just look for a target document/id in my environment and replace the role node with a reference node, either using the resolved event, or a PostTransform.

But I can't find a way to implement the second step. The reading phase is too soon, because we can't have a list of things to display in B, until after A has been read. But writing phase is too late, because we would need to store the target id of things displayed in B before C is being written.

Describe the solution you'd like

I initially implemented step 2 and 3 as PostTransforms with increasing priority, as I thought that a higher priority PostTransform was going to run on all doctrees before lower priorities PostTransforms are applied. Since the point is that the writing phase can be parallelized, I understand why this is not the case.

The two ideas that come to mind are:

  • Another registry of PostTransforms that are being all being applied before the writing phase. These could also be applied in order of (priority then document) instead of (document then priority). But this would mean a new name, documentation, internal changes, maintenance cost, etc.
  • Or a new event that would allow an extension to run a transform on all updated doctrees, again before the writing phase. Extensions would receive the list of all updated doctrees and would be able to run transforms in the correct order.

For the record, I also did try to define an event handler for env-check-consistency which is actually in the middle of the two phases. But the list of updated doctrees is not available, as it is a local variable of builder functions, and not a property of the builder. And if I try to create such a list myself and retrieve the doctree, there is no easy API to do so. BuildEnvironment does define a get_doctree method, but it does not include access to the read cache, and for some reason, my changes wouldn't persist to the writing phase anyway.

Describe alternatives you've considered
Could not see an alternative, but maybe I am missing something in how Sphinx works... ?

@bpatath bpatath added the type:proposal a feature suggestion label Jul 27, 2024
@AA-Turner
Copy link
Member

But I can't find a way to implement the second step. The reading phase is too soon, because we can't have a list of things to display in B, until after A has been read. But writing phase is too late, because we would need to store the target id of things displayed in B before C is being written.

Add a placeholder node and then use a post-transform to replace the placeholder node with the list of things.

See for example, https://github.com/AA-Turner/cpython/blob/docs/audit-event/Doc/tools/extensions/audit_events.py

A

@bpatath
Copy link
Author

bpatath commented Jul 27, 2024

Hey, thanks for the prompt reply !

That part is already what I am doing, but I don't think it helps regarding my issue. Following my exemple, if I go with placeholder and a PostTransform in the write phase, this is what I would typically have after the read phase:

env:

things: {
    foo: { ... },
    bar: { ... }
}

A.doctree: empty, directives just updated the env
B.doctree:

<ThingList>

C.doctree:

A link to <ThingRef reftype="thing" refid="foo">

Then, I would need to:

  • first run the PostTransform on B to transform <ThingList> into actual nodes including <target>s, and also mutating the env to store the targets document and id.
    • Note that since the <target>s are for foo and bar, and not for the list itself, I couldn't have generated them while reading B.
    • B.doctree would look like
      <paragraph>
         <target id="thing-foo">
         <target id="thing-bar">
      
    • env would look like
      things: {
          foo: { target: { doc: "B", id: "thing-foo" } },
          bar: { target: { doc: "B", id: "thing-bar"} },
      }
      
  • and then run a PostTransform on C to replace <ThingRef> with an actual <reference>, using the mutated environment to retrieve the target document and id, required to generate the URL.

But the writing phase is not ordered, so if the writing of C happens after the writing of B, the environment does not contain target information and the references cannot be resolved.
And even if some kind of writing order was possible, I could have a case where I have a list in B and a ref in C, as well as a list in C and a ref in B, meaning that no writing order would be valid.

Sphinx allows:

  • Do Replace <ThingList> then Replace <ThingRef> for file B
  • Then (or in parralel), do Replace <ThingList> then Replace <ThingRef> for file C.

But not:

  • Do Replace <ThingList> for file B and C.
  • Then do Replace <ThingRef> for file B and C (possibly in parralel),

I feel like my issue arises from the fact that the target associated to an object is not necessarily in the file where the object is being "defined". Most examples will just use the read phase to generate a <target> node and store the target information in the env, making them directly available for the write phase.
Here, the target information cannot be generated in the read phase, but still need to be here before the write phase, hence I thought about a "middle" phase.

(I hope I am not too unclear, I find it not that easy to describe my issue...)

@AA-Turner
Copy link
Member

  • Note that since the <target>s are for foo and bar, and not for the list itself, I couldn't have generated them while reading B.

Ah, I didn't realise the targets were to items in the list.

Perhaps you could use write-started?

A

@bpatath
Copy link
Author

bpatath commented Jul 28, 2024

Perhaps you could use write-started?

Naming and phase wise, I guess this event makes sense.
Now my issue is how to actually apply the transform in the event handler. The two issues I'm facing are:

  • I don't have access to the list of updated docnames. The list is being built inside Builder.build, and then Builder.write. I guess the final list could be considered to be the one passed to Builder.prepare_writing ? But anyway, all of the manipulations on the list of docnames are performed as local variables, so they are not accessible from the event handler.
    Even the env-get-updated event, which is used to add files to the updated set, does not have access to the initial set of updated files.
    Am I missing a way to get such a list ? Or maybe this list should be either passed to the event handler, or stored as a Builder property, since the event handler does get a reference to the builder ?
    Or am I supposed to use a combination of env-before-read-docs, and env-purge-docto recreate this list myself ?
    Edit: env-before-read-docs gets a list but of files to re-read. The list of files that is being written is a combination of re-read file, files returned by env-get-updated by all extensions, and the list of build files passed to the builder depending on how it is called. I don't think I re-create this list...

  • Say I have the list of updated docnames, how would I get actual doctrees from them ? I tried to copy the behavior of BuildEnvironment.get_and_resolev_doctree, but the access to the _write_doc_doctree_cache from the extension felt a bit hacky. Is there any reason it is not part of get_doctree ?
    (I did try and failed to make it work. But now that I am writing this, I realize that I copied the pop call on the cache, which is probably why the changes did not persist u_u)

@bpatath
Copy link
Author

bpatath commented Jul 28, 2024

Update regarding the two issues:

  • I tried doing dependency management myself, in the environment. It works but I did have to change event and go with env-updated for a very simple reason: it is emitted before the environment is pickled. The final workflow is:

    • use env-before-read-docs to get all re-read docnames
    • read docs in parallel, record everything that changed in environment
    • use env-updated to:
      • determine which docs are out of date using dependencies stored in the env and the list of things that changed
      • for each docname in (re-read docnames + out of date)
        • use the PostTransform to replace placeholder with actual nodes/targets and store targets in env
        • during the PostTransform update the dependencies informations in the environment
      • return out of date docnames, to make sure they are re-written
    • env is being pickled, containing all target informations + dependency informations

    It works but it feels a bit hacky to use the env-updated event to perform transforms, doesn't it ?

    When I figured that the environment was pickled this soon, I was also wondering how does Sphinx manages "write dependencies"... and it doesn't ? A simple example is: file A having a :ref: to a section in file B; update file B to remove the section; rebuild => only file B is being written again, file A's output still holds a reference to a now-unexisting section.
    Is this a known issue ? Is there a plan to work around this, or it just something Sphinx is not meant to be capable of doing ?

  • Copying the behavior of get_and_resolve_doctree (without the pop) did work, but again using _write_doc_doctree_cache feels hacky. Any way that this could become part of get_doctree or another helper method somehow ?

         try:
             doctree = env._write_doc_doctree_cache[docname]
             doctree.settings.env = env
             doctree.reporter = LoggingReporter(str(env.doc2path(docname)))
         except KeyError:
             doctree = env.get_doctree(docname)
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:proposal a feature suggestion
Projects
None yet
Development

No branches or pull requests

2 participants