Skip to content

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented Mar 5, 2021

Opening a Series with many iterations is currently a time-intensive procedure due to eager parsing.
This adds an option to lazily parse a Series. If that option is chosen, Iteration::open() must be called before accessing. ReadIterations will do that automatically.

I previously attempted parsing iterations automatically upon accessing them via the Container interface, but there were too many edge cases and it's probably better to do such things explicitly in parallel situations.

TODO:

@ax3l ax3l self-requested a review March 9, 2021 18:34
@ax3l ax3l self-assigned this Mar 9, 2021
@ax3l ax3l added the api: new additions to the API label Mar 9, 2021
@franzpoeschel franzpoeschel force-pushed the topic-lazy-parsing branch 3 times, most recently from 051ede0 to 18cbf90 Compare March 16, 2021 12:32
@franzpoeschel franzpoeschel force-pushed the topic-lazy-parsing branch 2 times, most recently from 0b0ddaf to 9f5cbc1 Compare March 26, 2021 15:11
@franzpoeschel franzpoeschel changed the title [WIP] lazy parsing of iterations lazy parsing of iterations Mar 26, 2021
@franzpoeschel franzpoeschel force-pushed the topic-lazy-parsing branch 2 times, most recently from 33ebc35 to 2671df4 Compare March 29, 2021 12:28
@ax3l ax3l changed the title lazy parsing of iterations lazy parsing of iterations & Series Builder Pattern Mar 29, 2021
Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added inline :)

private:
Iteration();

struct DeferredRead
Copy link
Member

@ax3l ax3l Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a doxygen string here, just to clarify what this switch does.
Would DeferredReadAccess be more explicit? Not sure, ... maybe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename this one anyway, I'd go for DeferredParseAccess to avoid confusion with deferred load_chunks

Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of the lazy/deferred parsing reads well. One general comment, I would unify the naming for this in the code base, at the moment we use:

  • lazy
  • deferred read
  • deferred access
  • (not) yet accessed

To describe the same thing, which can be confusing as the same concept in the code is meant.

Should we skip the introduction of the term "lazy" and instead use "deferred (iteration) access" in all places?


Do we want to make this a Series constructor parameter or JSON option?


Should we introduce the builder pattern in a separate PR and/or use it more consistently throughout the code base & examples?

Copy link
Contributor Author

@franzpoeschel franzpoeschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of the lazy/deferred parsing reads well. One general comment, I would unify the naming for this in the code base, at the moment we use:

* lazy

* deferred read

* deferred access

* (not) yet accessed

To describe the same thing, which can be confusing as the same concept in the code is meant.

Should we skip the introduction of the term "lazy" and instead use "deferred (iteration) access" in all places?

I've renamed things now:

parse_lazily -> defer_iteration_parsing
NotYetAccessed -> ParseAccessDeferred
Iteration::deferRead -> Iteration::deferParseAccess
DeferredRead -> DeferredParseAccess
m_deferredRead -> m_deferredParseAccess
parseLazily -> runDeferredParseAccess

Do we want to make this a Series constructor parameter or JSON option?

Good idea, done

Should we introduce the builder pattern in a separate PR and/or use it more consistently throughout the code base & examples?

I introduced the builder pattern since it made for a better way to have a constructor with more than one defaulted argument. Since we're now back at one defaulted argument, we don't necessarily need it anymore. I've isolated the builder pattern to a branch franzpoeschel/topic-builder now, should I do a PR?

Also, we can definitely use it more in the examples and tests. In the code base itself probably not so much since we seldomly create a Series object ourselves.

private:
Iteration();

struct DeferredRead
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename this one anyway, I'd go for DeferredParseAccess to avoid confusion with deferred load_chunks

@franzpoeschel franzpoeschel changed the title lazy parsing of iterations & Series Builder Pattern lazy parsing of iterations ~~& Series Builder Pattern~~ Mar 30, 2021
@franzpoeschel franzpoeschel changed the title lazy parsing of iterations ~~& Series Builder Pattern~~ lazy parsing of iterations (& Series Builder Pattern – update: removed from this PR) Mar 30, 2021
std::unique_ptr< ParsedInput > parseInput(std::string);
// use a template in order not to expose nlohmann_json to users
template< typename JSON >
void parseJsonOptions( JSON const & );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you only call this in the .cpp file, you can define this helper function only in the Series.cpp file and in a namespace {} (as you already do) and avoid that it gets exposed as a function and symbol at all :)

Copy link
Contributor Author

@franzpoeschel franzpoeschel Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that easy because that function accesses private members. But we can theoretically make all members of SeriesData public and inherit it privately in SeriesInternal. So, the data struct can be passed to non-friend functions for manipulation?
Edit: Pushed a commit that does just that.
Edit: Not so easy either, this breaks dynamic casting. I've made it a public inheritance now, meaning that SeriesInternal has all members public now. Series still hides them.

uint64_t iteration{};

// support for std::tie
operator std::tuple< bool &, int &, uint64_t & >()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever :)

First attempt at an implementation of lazy parsing

todo:
* file-based layout
* iterator access

Basic file-based deferred parsing

Remove changes to Container class

Read eagerly by default

Read iteration upon Iteration::open

Expose to frontend and add some tests

Use Builder Pattern for Series
franzpoeschel and others added 12 commits April 1, 2021 12:25
In that case, we must not attempt to piece back together the filename by
filebased methods, but instead directly use the filename specified by
the user.
parse_lazily -> defer_iteration_parsing
NotYetAccessed -> ParseAccessDeferred
Iteration::deferRead -> Iteration::deferParseAccess
DeferredRead -> DeferredParseAccess
m_deferredRead -> m_deferredParseAccess
parseLazily -> runDeferredParseAccess
This makes all members of SeriesData public, but Series still hides
them.
@ax3l ax3l changed the title lazy parsing of iterations (& Series Builder Pattern – update: removed from this PR) lazy parsing of iterations Apr 5, 2021
Copy link
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: new additions to the API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants