Skip to content

Commit 0ec9c29

Browse files
committed
walking_massive_graph
1 parent 532fb4f commit 0ec9c29

File tree

1 file changed

+59
-0
lines changed

1 file changed

+59
-0
lines changed
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Walking Massive Graphs with BloomFilter-Based Tracking
2+
3+
When working with massive graphs—such as web link graphs, social networks, or distributed dependency graphs—the cost of tracking visited nodes can dominate the memory budget. Traditional `Set<T>` may not be able to hold all nodes in memory.
4+
5+
In scenarios where the graph is discovered lazily (e.g. via RPC or streaming edge discovery), a Bloom filter offers an appealing trade-off: sublinear memory in exchange for a controlled false positive rate.
6+
7+
`Walker` supports custom node tracker so this can be easily achieved using Guava `BloomFilter`:
8+
9+
```java
10+
BloomFilter visited = ...;
11+
Walker<WebPage> walker =
12+
Walker.inGraph(
13+
webPage -> webPage.exploreLinks(),
14+
page -> visited.mightContain(page.url()));
15+
```
16+
17+
The bloom filter structure offers 0 false-negative so infinite loop is impossible, but the false
18+
positives might result in certain nodes being discarded at low probability.
19+
20+
## The Problem with False Positives
21+
22+
Bloom filters occasionally return "seen" for a node that was never actually visited.
23+
In the context of graph traversal, this can cause a issue:
24+
**entire subgraphs may become permanently unreachable** if the only path to them is mistakenly pruned.
25+
26+
This is especially problematic during cold start, when the Bloom filter is relatively sparse but already
27+
begins to emit false positives as nodes accumulate.
28+
29+
Can we mitigate it?
30+
31+
## Strategy: Probabilistically Trust Bloom
32+
33+
To mitigate this, we adopt a hybrid approach:
34+
35+
- Use a **Bloom filter** as the primary visited structure.
36+
- Augment with a **Set<T>** that tracks a limited number of nodes at cold start.
37+
- Define a **minimum threshold** (`minConfirmSize`) before trusting the Bloom filter alone.
38+
39+
### Lifecycle
40+
41+
1. **Cold Start Phase**:
42+
- During the first N visits (`confirmed.size() < minConfirmSize`),
43+
every node is added to both the `confirmed` set and the Bloom filter.
44+
- The Bloom filter is not yet trusted.
45+
46+
2. **Steady-State Phase**:
47+
- On visiting a node:
48+
- If Bloom says it hasn't seen the node, visit it.
49+
- If Bloom says it's seen the node, roll a dice to determine if we should trust Bloom.
50+
- If we trust Bloom, prune the node
51+
- Otherwise visit it
52+
53+
By configuring the probability (say 50%), we can achieve a few things:
54+
55+
1. A frequently reached node is unlikely to be false positively pruned.
56+
2. Even if a few-visited node is false positively pruned this time, it won't be forever lost on the next run of the program.
57+
3. The higher probability we configure, the more redundant processing we need to tolerate.
58+
59+
In massive, lazily-discovered graphs, this strategy balances correctness with scalability—and gives every path a second chance to succeed.

0 commit comments

Comments
 (0)