fix: name community splits from their members and dedupe community names#603
Open
SHudici wants to merge 1 commit into
Open
fix: name community splits from their members and dedupe community names#603SHudici wants to merge 1 commit into
SHudici wants to merge 1 commit into
Conversation
Splitting an oversized community produced shards named "<parent>-sub<N>" with a hardcoded cohesion of 0.0 — on a mid-size production graph, 14 of the 35 largest communities were opaque "services-load-subN" entries indistinguishable from each other, all reporting zero cohesion. Separately, nothing enforced name uniqueness: the same graph carried three distinct communities all named "services-job", and get_community resolves by name match, so two of the three were unreachable by name. - _split_oversized now names each shard from its own members via _generate_community_name (falling back to "<parent>-<id>" only when members yield nothing) and computes real cohesion for every shard in one _compute_cohesion_batch pass over the full edge set. - detect_communities runs a new _dedupe_community_names pass: within a collision group the largest community keeps the bare name; each other member gets its most distinctive keyword as a suffix (skipping keywords already in the name and candidates that would collide with any existing name), with a deterministic numeric fallback.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two related defects make community output hard to trust on real graphs:
_split_oversizednames every shard"<parent>-sub<N>"and hardcodes"cohesion": 0.0. On a mid-size production graph (~7k nodes, 103 communities), 26 of the 103 communities were indistinguishableservices-load-subNentries, all reporting zero cohesion — the numbers looked like detection failures, not results.services-normalize×4,services-job×3,services-fake×3,services-sql×2,services-testclientlazyinit×2 — 14 communities sharing 5 names).get_communityresolves by name match, so 9 of those 14 were unreachable by name.Fix
_split_oversizednames each shard from its own members via the existing_generate_community_name(fallback"<parent>-<id>"only when members yield nothing), and computes real cohesion for all shards in one_compute_cohesion_batchpass over the full edge set — same cost profile as the top-level detectors.detect_communitiesruns a new_dedupe_community_namespass after splitting: within a collision group the largest community keeps the bare name; every other member is suffixed with its most distinctive keyword (skipping keywords already in the name and candidates that would collide with any existing name), with a deterministic numeric fallback.Measured effect (same production repo)
Re-ran community detection on the same graph (103 communities before and after — the partition is untouched, only naming and cohesion change):
-subNplaceholder namesget_communityby nameThe 26 former
services-load-subNshards now carry member-derived names with real cohesion, e.g.services-upload(85 members, 0.29),services-rows(74, 0.42),lx-raw(35, 0.27),iag-market(8, 0.33),afklm-parse(7, 0.41).Testing
New tests: a dumbbell-graph oversized community splits into member-named shards (no
-subN), each with exact expected cohesion (15/16) and parent lineage preserved; dedup keeps the bare name on the largest community, suffixes the rest by keyword, skips keywords already in the name, avoids colliding with existing community names, falls back to a numeric suffix, and leaves unique names untouched. Full suite passes.Composes with #600 (community naming quality) but does not require it; cut independently from main.
🤖 Generated with Claude Code