Skip to content

Phase 3: Optimize Duplicate Detection (1.25-1.76× Speedup)#45

Open
dshkol wants to merge 2 commits into
masterfrom
perf/single-pass-dedup
Open

Phase 3: Optimize Duplicate Detection (1.25-1.76× Speedup)#45
dshkol wants to merge 2 commits into
masterfrom
perf/single-pass-dedup

Conversation

@dshkol
Copy link
Copy Markdown
Collaborator

@dshkol dshkol commented Nov 11, 2025

Summary

This PR optimizes duplicate detection in GIS conversion functions by replacing a double-scan approach with a single-pass algorithm, achieving 1.25-1.76× speedup depending on input characteristics.

Problem

The previous implementation scanned input twice to detect and remove duplicates:

if (anyDuplicated(gh) > 0L) {        # Scan 1: Check for duplicates
  idx = which(duplicated(gh))        # Scan 2: Find duplicate indices
  gh = gh[-idx]
}

This double-scan is wasteful since duplicated() already identifies all duplicates.

Solution

Single-pass approach:

dup_idx = duplicated(gh)             # Single scan: Find duplicates
if (any(dup_idx)) {
  gh = gh[!dup_idx]
}

Applied to three functions:

  • gh_to_sp
  • gh_to_spdf.default
  • gh_to_spdf.data.frame

Performance Results

Comprehensive benchmarks across varying:

  • Input sizes: 100, 1K, 10K, 100K geohashes
  • Duplicate ratios: 0%, 10%, 50%, 90%

Summary Statistics

  • Median speedup: 1.44×
  • Mean speedup: 1.48×
  • Best case: 1.76× (small inputs)
  • Large inputs: 1.25-1.48×

By Duplicate Ratio

Duplicates Median Speedup
0% 1.38×
10% 1.36×
50% 1.44×
90% 1.51×

Higher duplicate ratios show better relative performance (more work saved).

Testing

✅ All 48 GIS tests pass
✅ No breaking changes to API or behavior
✅ Same output, same warnings, just faster
✅ Benchmark script and plots included

Visualizations

The PR includes:

  • benchmarks/dedup_speedup.png - Speedup factor across test cases
  • benchmarks/dedup_absolute.png - Absolute performance comparison
  • benchmarks/dedup_benchmark.R - Reproducible benchmark script

Changes

  • R/gis_tools.R: Updated duplicate detection in 3 functions
  • NEWS.md: Documented performance improvement
  • benchmarks/: Added microbenchmark analysis

Verification

# Run benchmarks yourself
source("benchmarks/dedup_benchmark.R")

# Run tests
devtools::test(filter = "gis")

Series Progress

  • ✅ Phase 1: Quick fixes and code quality (#43)
  • ✅ Phase 2: gh_covering optimization - 2-3× speedup (#44)
  • ✅ Phase 3: Duplicate detection optimization (this PR)
  • 🔜 Phase 4: CI/CD modernization (GitHub Actions)

🤖 Generated with Claude Code

Replaced double-scan approach (anyDuplicated + which(duplicated))
with single-pass (duplicated only) in gh_to_sp, gh_to_spdf.default,
and gh_to_spdf.data.frame.

Performance improvements:
- Small inputs (n=100): 1.68-1.76× faster
- Medium inputs (n=1k-10k): 1.37-1.55× faster
- Large inputs (n=100k): 1.25-1.48× faster

Median speedup: 1.44× across all input sizes and duplicate ratios
Better performance with higher duplicate ratios (1.51× at 90% duplicates)

All existing tests pass - no breaking changes to API or behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant