Skip to content

Skip indexing unchanged files#210

Closed
jenny-codes wants to merge 1 commit intographite-base/210from
jennyshih/index-new-and-updated
Closed

Skip indexing unchanged files#210
jenny-codes wants to merge 1 commit intographite-base/210from
jennyshih/index-new-and-updated

Conversation

@jenny-codes
Copy link
Contributor

@jenny-codes jenny-codes commented Oct 15, 2025

For #88

See the full work at main.../jennyshih/save-updated-graph

This is a different implementation than #173. In the previous version, here is the flow:

indexing starts 
-> 1. collect documents from workspace
      1-1 read content from file system
-> 2. *calculate diff against db state*
-> 3. *delete stale entries from db* 
-> 4. index individual files
-> 5. build global graph in memory 
-> 6. *save graph to db*

which has a few issues

  1. To do the staleness check we need to read the document source from the filesystem before going into the parallel indexing (step 4), which can be a bottleneck for us down the road.
  2. It attempts to save all entries into the database, including the ones that remain unchanged which is wasteful and a bit awkward.

This new attempt here is as follows:

-> 1. collect documents from workspace 
-> 2. index individual files
         2-1 read content from file system
         2-2 calculate diff
         2-3 exit early if same entry exists in db
-> 3. build global graph in memory (merging local index into global)
         3-1 when clearing global index duplicate uri data, clear the db too (not yet implemented)
         3-2 before adding the new data to memory, write to db (not yet implemented)
-> 4. delete the deleted entries from the db (not yet implemeneted) 

So for our four kinds of uris:

  1. new uris: they will be saved individually during parallel indexing
  2. updated uris: their data will be deleted from the db and re-inserted during parallel indexing
  3. existing uris: we will not attempt to re-index them nor save them to db again. We will do nothing about them.
  4. deleted uris: they will be cleaned up when we finish the parallel indexing of the present uris.

This pr implements the part where it looks up the db document entries and use it to inform the indexing worker to skip a file when it has not been changed.

Prs to come

  1. Save the new/updated entries to db after indexing
  2. Delete the entries in db that are no longer in the filesystem

Benchmark

Skipping the unchanged files reduced the db save time to 0 when we run the index twice on the same repo, but does not really affect the indexing time.

Before

Query statistics

Total declarations:         684127
With documentation:         71130 (10.4%)
Without documentation:      612997 (89.6%)
Total documentation size:   211400 bytes
Multi-definition names:     14691 (2.1%)

Definition breakdown:
  Method               315306
  InstanceVariable     178723
  Module               175442
  Class                100527
  Constant              60735
  AttrReader            31333
  AttrAccessor          11458
  AttrWriter              370
  GlobalVariable           36
  ClassVariable            11

Timing breakdown
  Initialization      0.001s (  0.0%)
  Listing             1.440s ( 11.6%)
  Indexing            2.439s ( 19.7%)
  Querying            0.223s (  1.8%)
  Database            8.276s ( 66.9%)
  Cleanup             0.000s (  0.0%)
  Total:             12.380s

Maximum RSS: 657997824 bytes (627.52 MB)

Indexed 85386 files
Found 684127 names
Found 873941 definitions
Found 85386 URIs

After

Query statistics

Total declarations:         1
With documentation:         0 (0.0%)
Without documentation:      1 (100.0%)
Total documentation size:   0 bytes
Multi-definition names:     0 (0.0%)

Definition breakdown:

Timing breakdown
  Initialization      0.002s (  0.0%)
  Listing             3.318s ( 58.9%)
  Indexing            2.300s ( 40.9%) 
  Querying            0.000s (  0.0%)
  Database            0.009s (  0.2%) <- this is for the querying
  Cleanup             0.000s (  0.0%)
  Total:              5.630s

Maximum RSS: 65519616 bytes (62.48 MB)

Indexed 0 files
Found 1 names
Found 0 definitions
Found 0 URIs

Copy link
Contributor Author

jenny-codes commented Oct 15, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@jenny-codes jenny-codes requested a review from a team October 15, 2025 20:59
@jenny-codes jenny-codes force-pushed the jennyshih/content-hash-struct branch from 3bc257e to f6036b8 Compare October 15, 2025 21:05
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 85d09ee to 6b0ba47 Compare October 15, 2025 21:05
@jenny-codes jenny-codes force-pushed the jennyshih/content-hash-struct branch from f6036b8 to a0fa5a7 Compare October 15, 2025 21:11
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch 2 times, most recently from 3d94630 to 270cfeb Compare October 17, 2025 14:12
@jenny-codes jenny-codes force-pushed the jennyshih/content-hash-struct branch 4 times, most recently from 0bf5aa2 to 6559a40 Compare October 17, 2025 16:03
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 270cfeb to 06357d1 Compare October 17, 2025 16:04
@jenny-codes jenny-codes force-pushed the jennyshih/content-hash-struct branch from 6559a40 to 810f92f Compare October 17, 2025 17:00
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 06357d1 to 47005ee Compare October 17, 2025 17:00
@jenny-codes jenny-codes force-pushed the jennyshih/content-hash-struct branch from 810f92f to e91ac9e Compare October 17, 2025 17:32
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch 2 times, most recently from ccfc0c2 to 4a6a724 Compare October 17, 2025 19:00
@jenny-codes jenny-codes force-pushed the jennyshih/content-hash-struct branch from e91ac9e to bf9dfb0 Compare October 17, 2025 20:08
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch 2 times, most recently from 0e4e1e0 to 6bb1350 Compare October 17, 2025 20:11
@jenny-codes jenny-codes force-pushed the jennyshih/content-hash-struct branch from bf9dfb0 to 51b2896 Compare October 17, 2025 20:11
@jenny-codes jenny-codes changed the title Sync with db during local file indexing Skip indexing unchanged files Oct 17, 2025
@jenny-codes jenny-codes marked this pull request as ready for review October 17, 2025 20:20
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 6bb1350 to 043ee18 Compare October 17, 2025 20:24
@jenny-codes jenny-codes force-pushed the jennyshih/content-hash-struct branch from 51b2896 to 5c5c994 Compare October 20, 2025 17:33
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 043ee18 to a487d33 Compare October 20, 2025 17:33
@jenny-codes jenny-codes marked this pull request as draft October 20, 2025 18:34
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 2d7510b to 7688c0a Compare October 29, 2025 23:18
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 7688c0a to 9f8eefa Compare October 30, 2025 15:02
@jenny-codes jenny-codes requested a review from Morriar October 30, 2025 15:03
@jenny-codes jenny-codes changed the base branch from main to graphite-base/210 October 30, 2025 15:47
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 9f8eefa to fbfd830 Compare October 30, 2025 15:47
@jenny-codes jenny-codes changed the base branch from graphite-base/210 to jennyshih/create-index-result October 30, 2025 15:48
@jenny-codes jenny-codes force-pushed the jennyshih/create-index-result branch from 63950f8 to 943e408 Compare October 30, 2025 16:23
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from fbfd830 to 1a0b597 Compare October 30, 2025 16:23
@jenny-codes jenny-codes force-pushed the jennyshih/create-index-result branch from 943e408 to 5957ec7 Compare October 30, 2025 21:25
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch 2 times, most recently from 0134e3c to e1541ac Compare October 31, 2025 14:24
@jenny-codes jenny-codes force-pushed the jennyshih/create-index-result branch from 5957ec7 to 4fd17e1 Compare October 31, 2025 14:24
@jenny-codes jenny-codes changed the base branch from jennyshih/create-index-result to graphite-base/210 October 31, 2025 16:25
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from e1541ac to 5089b00 Compare October 31, 2025 16:25
@jenny-codes jenny-codes changed the base branch from graphite-base/210 to jennyshih/multiple-error October 31, 2025 16:26
@jenny-codes jenny-codes force-pushed the jennyshih/multiple-error branch from 10bef90 to d55b7d1 Compare October 31, 2025 17:14
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 5089b00 to 00c882a Compare October 31, 2025 17:14
@jenny-codes jenny-codes force-pushed the jennyshih/multiple-error branch 2 times, most recently from 548ce91 to 87b5c3a Compare October 31, 2025 17:25
@jenny-codes jenny-codes force-pushed the jennyshih/index-new-and-updated branch from 00c882a to 4bcd1de Compare October 31, 2025 17:25
@jenny-codes jenny-codes changed the base branch from jennyshih/multiple-error to graphite-base/210 October 31, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants