Skip to content

Implement parsed file caching to speed things up on larger codebases #2055

@somewhatabstract

Description

@somewhatabstract

Currently, checksync parses the files fresh on each run. On large codebases, this is pretty slow. There is the --fromCache mode but this takes a complete parse-set and doesn't track changes to the parsed files; it's just a snapshot of the parsed state for a given run.

Instead, tools like babel will cache the processing of each file and then use the cache when it can. If a file is deemed to have changed, the cache is updated, otherwise the cache is used.

Cache format

This requires us to parse files in a mostly configuration agnostic manner. What files we parse and ignore would need to be adhered to, but anything that affects the output errors should not affect the cache format.

Cache invalidation

The simplest approach would be to look at file modification times when compared with the cached equivalent. However, if this isn't reliable, a one-way hash of the file contents could be used - it would slow down the first run, but as long as the hash calculation plus reuse of cached files is faster than a full parse, it would still be a win.

So, checksync would look in the cache for a parsed state of a given file, and if it is there, and it is considered "up-to-date", it would use that instead of re-parsing that file.

Other considerations

#887 and cross-repo "local" tags

Having this on-disk cache approach opens the door to making #887 a reality by creating a snapshot of the parsed state of a repo that some other checksync run can reference when validating its own tags.

--fromCache and --outputCache

These options should be deleted when implementing this, since the on-disk cache would remove the need for them.

Forcing cache clearing

There should be a mechanism for ignoring or clearing the cache explicitly. Perhaps, a --clearCache arg and/or a --ignoreCache arg...or a --cache=clear, --cache=ignore type pattern.

Error reporting

This could open the door to more expressive errors. For example, some errors may affect different lines of code. Currently, we only report the line of the tag with the error, but we may want to reference the first tag in a batch of tags, as well as the tag with the error, as in this case. This is technically possible now, but the implementation and architecture don't facilitate it. Any refactoring and redesign done to support a cache could make this easier.

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions