Skip to content

Comments

pdpv0: mark specific pieceref for indexing instead of all refs with same piece_cid#1017

Merged
ZenGround0 merged 7 commits intofilecoin-project:pdpv0from
TippyFlitsUK:pdpv0
Feb 18, 2026
Merged

pdpv0: mark specific pieceref for indexing instead of all refs with same piece_cid#1017
ZenGround0 merged 7 commits intofilecoin-project:pdpv0from
TippyFlitsUK:pdpv0

Conversation

@TippyFlitsUK
Copy link
Contributor

@TippyFlitsUK TippyFlitsUK commented Feb 17, 2026

Summary

EnableIndexingForPiecesInTx marks all pdp_piecerefs rows sharing a piece_cid as needs_indexing = TRUE, rather than only the specific ref used in the current AddPieces call. Since the HarmonyTask scheduler creates one indexing task per row, a piece uploaded N times generates N tasks on every AddPieces call — regardless of whether the content is already indexed.

Observed Impact (mainnet, as of 2026-02-17)

Metric Value
Indexing tasks per day ~160,000
Expected without bug (1 task per add) ~3,500
Multiplication factor ~46×
Share attributable to 4 high-duplication pieces 98.4%

The problem compounds over time: every new upload creates another ref row, so each subsequent AddPieces call triggers yet more tasks. Task count grows without bound as refs accumulate.


Root Cause

EnableIndexingForPiecesInTx (pdp/indexing.go) matches by piece_cid, marking all N refs when only the one specific ref used in this add should be marked.

When N−1 redundant tasks run, they exit immediately via CheckHasPiece (task_pdp_indexing.go:73) because the content is already indexed — but each still consumes a scheduler slot.

The specific ref ID for the current AddPieces call is already available in subPieceInfoMap[cid].PDPPieceRefID at both call sites — the same value already written to pdp_data_set_piece_adds by insertPieceAdds. This fix threads that ID through to EnableIndexingForPiecesInTx instead of the CID.

Note on existing comment (indexing.go:132–133): The comment acknowledged that duplicate pieces might be re-marked and noted the task handles this "smoothly". This is true for occasional duplicates but fails at scale.


Client-Side Evidence

The issue was surfaced by a single client uploading 4 static web app files (the OnSui ENS/SuiNS resolver by HAPPYS1NGH, source: github.com/HAPPYS1NGH/suins-ens-gateway) to dataset 64 at high frequency. All 4 files contain identical content across uploads and resolve to the same IPFS CIDs:

CommP (piece CID) IPFS CID Raw Size Refs Adds/Day Tasks/Day (bug) Tasks/Day (fixed)
baga6ea4seaqe7od536xh4nmoh3jvznidlfrh5n76uvl5webnjsw6iesst3ungci bafybeihpn2wuwqddd6gmn42bx7ugmr7jcsjo7bjcopeujbj32iemoz2rye 300,290 b 226 221 49,946 221
baga6ea4seaqc6cbmpqplutnbwoagjkwarcid52ii63vgy4mxoz43by7sq3wbikq bafybeicmewaqj4zwv52p35aoxpjsleufcbsvnsnctclk3sk7eepdjagvdq 300,451 b 226 218 49,268 218
baga6ea4seaqpjn6dlpcelwwrrctqkphvnd7pyfhy6hzw3v3kidc2qjy5c7ggiaa bafybeihg5ue37cqldodnmzzqhnkxlnnqsrpe5wdmbzudbi6psgtixknp7m 298,112 b 224 212 47,488 212
baga6ea4seaqfdki5ebwjk3bemm5b4otrcfa3gmxp7dnnwnjasi6pwomwyhxbcpy bafybeiehmlrjepxiy4vnrhfu7hzihdvhggn54tugglogi2dosdmyn7kovm 297,956 b 182 178 32,396 178
Total 179,098 829

All 4 files have been continuously uploaded since 2026-02-08. The client's deployment tooling re-uploads the same content approximately hourly and issues a fresh AddPieces call to the dataset roughly every 7 minutes. This is not itself incorrect behaviour — the server-side bug is what turns each AddPieces call into hundreds of redundant tasks.

These 4 pieces alone account for 179,098 indexing tasks per day — over 98% of the node's total indexing load — from content that is fully indexed after the very first task per piece.


Why This Is Safe

  • Internal consistency: PDPPieceRefID in subPieceInfoMap is the exact same ref ID already written to pdp_data_set_piece_adds.pdp_pieceref by insertPieceAdds. The fix does not change what gets indexed — only which row triggers it.

  • One flag is sufficient: CheckHasPiece (early-exit in task_pdp_indexing.go:73) is keyed by piece_cid, not ref ID. Once any ref's task indexes the content, all refs for that piece_cid benefit — subsequent tasks exit immediately.

  • Only one setter: needs_indexing = TRUE is set in exactly one place in the codebase (indexing.go:136). There are exactly two callers of EnableIndexingForPiecesInTx — both updated in this PR.

  • IPNI unaffected: needs_ipni is set per-ref by the indexing task after actual indexing work completes, not by this function.


Expected Outcome After Fix

~98% reduction in daily indexing task volume. After this fix, accumulated ref count is irrelevant — task count equals the number of AddPieces calls only, regardless of how many times a piece was previously uploaded.


Optional: Immediate Database Relief for Affected Operators

Orphaned refs (data_set_refcount = 0, needs_indexing = FALSE) can be safely removed. They have no entries in pdp_data_set_pieces, no pending adds in pdp_data_set_piece_adds, and their content is already indexed.

-- Verify candidates before running
SELECT piece_cid, COUNT(*) AS deletable_refs
FROM pdp_piecerefs
WHERE data_set_refcount = 0
  AND needs_indexing = FALSE
GROUP BY piece_cid
HAVING COUNT(*) > 1
ORDER BY deletable_refs DESC
LIMIT 20;

-- Remove orphaned refs
DELETE FROM pdp_piecerefs
WHERE data_set_refcount = 0
  AND needs_indexing = FALSE;

…h same piece_cid

EnableIndexingForPiecesInTx was matching by piece_cid, marking ALL pdp_piecerefs
rows that share a CID as needs_indexing=TRUE. Since the HarmonyTask scheduler
creates one indexing task per row, a piece uploaded N times generates N tasks on
every AddPieces call regardless of whether the content is already indexed.

The specific ref ID for the current AddPieces call is already available in
subPieceInfoMap[cid].PDPPieceRefID at both call sites. This commit threads that
ID through to EnableIndexingForPiecesInTx instead of the piece CID, reducing
task volume by ~98% on affected nodes.
@TippyFlitsUK TippyFlitsUK changed the title fix(pdp): mark specific pieceref for indexing instead of all refs with same piece_cid pdpv0: mark specific pieceref for indexing instead of all refs with same piece_cid Feb 17, 2026
@github-actions github-actions bot added the team/fs-wg Items being worked on or tracked by the "FS Working Group". See FilOzone/github-mgmt #10 label Feb 17, 2026
Copy link
Collaborator

@ZenGround0 ZenGround0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find, what a time to be alive with AI models outputting stuff this useful!
All the comments are nice to have but you can merge as is. I can also take over if its too much of a pain. But since you're using claude you can probably just give it the url to this PR and they'll respond correctly.

@rjan90 rjan90 added this to the M4.1: mainnet ready milestone Feb 18, 2026
@rjan90 rjan90 linked an issue Feb 18, 2026 that may be closed by this pull request
@rjan90 rjan90 added this to FOC Feb 18, 2026
@rjan90 rjan90 moved this to 📌 Triage in FOC Feb 18, 2026
@github-project-automation github-project-automation bot moved this from 📌 Triage to ✔️ Approved by reviewer in FOC Feb 18, 2026
@TippyFlitsUK
Copy link
Contributor Author

Thanks for your feedback and suggestions, @ZenGround0! Incredibly useful information for me! 🙏

I have made the changes you suggested. Would be very grateful for a quick re-review when you have a chance.

@ZenGround0 ZenGround0 merged commit cce1732 into filecoin-project:pdpv0 Feb 18, 2026
15 checks passed
@github-project-automation github-project-automation bot moved this from ✔️ Approved by reviewer to 🎉 Done in FOC Feb 18, 2026
@TippyFlitsUK TippyFlitsUK deleted the pdpv0 branch February 18, 2026 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team/fs-wg Items being worked on or tracked by the "FS Working Group". See FilOzone/github-mgmt #10

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pdpv0: investigate large indexing backlog anecdote

3 participants