pdpv0: mark specific pieceref for indexing instead of all refs with same piece_cid#1017
Merged
ZenGround0 merged 7 commits intofilecoin-project:pdpv0from Feb 18, 2026
Merged
Conversation
…h same piece_cid EnableIndexingForPiecesInTx was matching by piece_cid, marking ALL pdp_piecerefs rows that share a CID as needs_indexing=TRUE. Since the HarmonyTask scheduler creates one indexing task per row, a piece uploaded N times generates N tasks on every AddPieces call regardless of whether the content is already indexed. The specific ref ID for the current AddPieces call is already available in subPieceInfoMap[cid].PDPPieceRefID at both call sites. This commit threads that ID through to EnableIndexingForPiecesInTx instead of the piece CID, reducing task volume by ~98% on affected nodes.
ZenGround0
approved these changes
Feb 18, 2026
Collaborator
ZenGround0
left a comment
There was a problem hiding this comment.
Great find, what a time to be alive with AI models outputting stuff this useful!
All the comments are nice to have but you can merge as is. I can also take over if its too much of a pain. But since you're using claude you can probably just give it the url to this PR and they'll respond correctly.
Clarify the comment about the function's parameters and behavior.
Removed redundant comment about duplicate uploads.
Contributor
Author
|
Thanks for your feedback and suggestions, @ZenGround0! Incredibly useful information for me! 🙏 I have made the changes you suggested. Would be very grateful for a quick re-review when you have a chance. |
ZenGround0
approved these changes
Feb 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EnableIndexingForPiecesInTxmarks allpdp_piecerefsrows sharing apiece_cidasneeds_indexing = TRUE, rather than only the specific ref used in the currentAddPiecescall. Since the HarmonyTask scheduler creates one indexing task per row, a piece uploaded N times generates N tasks on everyAddPiecescall — regardless of whether the content is already indexed.Observed Impact (mainnet, as of 2026-02-17)
The problem compounds over time: every new upload creates another ref row, so each subsequent
AddPiecescall triggers yet more tasks. Task count grows without bound as refs accumulate.Root Cause
EnableIndexingForPiecesInTx(pdp/indexing.go) matches bypiece_cid, marking all N refs when only the one specific ref used in this add should be marked.When N−1 redundant tasks run, they exit immediately via
CheckHasPiece(task_pdp_indexing.go:73) because the content is already indexed — but each still consumes a scheduler slot.The specific ref ID for the current
AddPiecescall is already available insubPieceInfoMap[cid].PDPPieceRefIDat both call sites — the same value already written topdp_data_set_piece_addsbyinsertPieceAdds. This fix threads that ID through toEnableIndexingForPiecesInTxinstead of the CID.Client-Side Evidence
The issue was surfaced by a single client uploading 4 static web app files (the OnSui ENS/SuiNS resolver by HAPPYS1NGH, source: github.com/HAPPYS1NGH/suins-ens-gateway) to dataset 64 at high frequency. All 4 files contain identical content across uploads and resolve to the same IPFS CIDs:
baga6ea4seaqe7od536xh4nmoh3jvznidlfrh5n76uvl5webnjsw6iesst3ungcibafybeihpn2wuwqddd6gmn42bx7ugmr7jcsjo7bjcopeujbj32iemoz2ryebaga6ea4seaqc6cbmpqplutnbwoagjkwarcid52ii63vgy4mxoz43by7sq3wbikqbafybeicmewaqj4zwv52p35aoxpjsleufcbsvnsnctclk3sk7eepdjagvdqbaga6ea4seaqpjn6dlpcelwwrrctqkphvnd7pyfhy6hzw3v3kidc2qjy5c7ggiaabafybeihg5ue37cqldodnmzzqhnkxlnnqsrpe5wdmbzudbi6psgtixknp7mbaga6ea4seaqfdki5ebwjk3bemm5b4otrcfa3gmxp7dnnwnjasi6pwomwyhxbcpybafybeiehmlrjepxiy4vnrhfu7hzihdvhggn54tugglogi2dosdmyn7kovmAll 4 files have been continuously uploaded since 2026-02-08. The client's deployment tooling re-uploads the same content approximately hourly and issues a fresh
AddPiecescall to the dataset roughly every 7 minutes. This is not itself incorrect behaviour — the server-side bug is what turns eachAddPiecescall into hundreds of redundant tasks.These 4 pieces alone account for 179,098 indexing tasks per day — over 98% of the node's total indexing load — from content that is fully indexed after the very first task per piece.
Why This Is Safe
Internal consistency:
PDPPieceRefIDinsubPieceInfoMapis the exact same ref ID already written topdp_data_set_piece_adds.pdp_piecerefbyinsertPieceAdds. The fix does not change what gets indexed — only which row triggers it.One flag is sufficient:
CheckHasPiece(early-exit intask_pdp_indexing.go:73) is keyed bypiece_cid, not ref ID. Once any ref's task indexes the content, all refs for thatpiece_cidbenefit — subsequent tasks exit immediately.Only one setter:
needs_indexing = TRUEis set in exactly one place in the codebase (indexing.go:136). There are exactly two callers ofEnableIndexingForPiecesInTx— both updated in this PR.IPNI unaffected:
needs_ipniis set per-ref by the indexing task after actual indexing work completes, not by this function.Expected Outcome After Fix
~98% reduction in daily indexing task volume. After this fix, accumulated ref count is irrelevant — task count equals the number of
AddPiecescalls only, regardless of how many times a piece was previously uploaded.Optional: Immediate Database Relief for Affected Operators
Orphaned refs (
data_set_refcount = 0,needs_indexing = FALSE) can be safely removed. They have no entries inpdp_data_set_pieces, no pending adds inpdp_data_set_piece_adds, and their content is already indexed.