|
| 1 | +# Walking Massive Graphs with BloomFilter-Based Tracking |
| 2 | + |
| 3 | +When working with massive graphs—such as web link graphs, social networks, or distributed dependency graphs—the cost of tracking visited nodes can dominate the memory budget. Traditional `Set<T>` may not be able to hold all nodes in memory. |
| 4 | + |
| 5 | +In scenarios where the graph is discovered lazily (e.g. via RPC or streaming edge discovery), a Bloom filter offers an appealing trade-off: sublinear memory in exchange for a controlled false positive rate. |
| 6 | + |
| 7 | +`Walker` supports custom node tracker so this can be easily achieved using Guava `BloomFilter`: |
| 8 | + |
| 9 | +```java |
| 10 | +BloomFilter visited = ...; |
| 11 | +Walker<WebPage> walker = |
| 12 | + Walker.inGraph( |
| 13 | + webPage -> webPage.exploreLinks(), |
| 14 | + page -> visited.mightContain(page.url())); |
| 15 | +``` |
| 16 | + |
| 17 | +The bloom filter structure offers 0 false-negative so infinite loop is impossible, but the false |
| 18 | +positives might result in certain nodes being discarded at low probability. |
| 19 | + |
| 20 | +## The Problem with False Positives |
| 21 | + |
| 22 | +Bloom filters occasionally return "seen" for a node that was never actually visited. |
| 23 | +In the context of graph traversal, this can cause a issue: |
| 24 | +**entire subgraphs may become permanently unreachable** if the only path to them is mistakenly pruned. |
| 25 | + |
| 26 | +This is especially problematic during cold start, when the Bloom filter is relatively sparse but already |
| 27 | +begins to emit false positives as nodes accumulate. |
| 28 | + |
| 29 | +Can we mitigate it? |
| 30 | + |
| 31 | +## Strategy: Probabilistically Trust Bloom |
| 32 | + |
| 33 | +To mitigate this, we adopt a hybrid approach: |
| 34 | + |
| 35 | +- Use a **Bloom filter** as the primary visited structure. |
| 36 | +- Augment with a **Set<T>** that tracks a limited number of nodes at cold start. |
| 37 | +- Define a **minimum threshold** (`minConfirmSize`) before trusting the Bloom filter alone. |
| 38 | + |
| 39 | +### Lifecycle |
| 40 | + |
| 41 | +1. **Cold Start Phase**: |
| 42 | + - During the first N visits (`confirmed.size() < minConfirmSize`), |
| 43 | + every node is added to both the `confirmed` set and the Bloom filter. |
| 44 | + - The Bloom filter is not yet trusted. |
| 45 | + |
| 46 | +2. **Steady-State Phase**: |
| 47 | + - On visiting a node: |
| 48 | + - If Bloom says it hasn't seen the node, visit it. |
| 49 | + - If Bloom says it's seen the node, roll a dice to determine if we should trust Bloom. |
| 50 | + - If we trust Bloom, prune the node |
| 51 | + - Otherwise visit it |
| 52 | + |
| 53 | +By configuring the probability (say 50%), we can achieve a few things: |
| 54 | + |
| 55 | +1. A frequently reached node is unlikely to be false positively pruned. |
| 56 | +2. Even if a few-visited node is false positively pruned this time, it won't be forever lost on the next run of the program. |
| 57 | +3. The higher probability we configure, the more redundant processing we need to tolerate. |
| 58 | + |
| 59 | +In massive, lazily-discovered graphs, this strategy balances correctness with scalability—and gives every path a second chance to succeed. |
0 commit comments