Phase 3: Optimize Duplicate Detection (1.25-1.76× Speedup) by dshkol · Pull Request #45 · MichaelChirico/geohashTools

dshkol · 2025-11-11T05:32:37Z

Summary

This PR optimizes duplicate detection in GIS conversion functions by replacing a double-scan approach with a single-pass algorithm, achieving 1.25-1.76× speedup depending on input characteristics.

Problem

The previous implementation scanned input twice to detect and remove duplicates:

if (anyDuplicated(gh) > 0L) {        # Scan 1: Check for duplicates
  idx = which(duplicated(gh))        # Scan 2: Find duplicate indices
  gh = gh[-idx]
}

This double-scan is wasteful since duplicated() already identifies all duplicates.

Solution

Single-pass approach:

dup_idx = duplicated(gh)             # Single scan: Find duplicates
if (any(dup_idx)) {
  gh = gh[!dup_idx]
}

Applied to three functions:

gh_to_sp
gh_to_spdf.default
gh_to_spdf.data.frame

Performance Results

Comprehensive benchmarks across varying:

Input sizes: 100, 1K, 10K, 100K geohashes
Duplicate ratios: 0%, 10%, 50%, 90%

Summary Statistics

Median speedup: 1.44×
Mean speedup: 1.48×
Best case: 1.76× (small inputs)
Large inputs: 1.25-1.48×

By Duplicate Ratio

Duplicates	Median Speedup
0%	1.38×
10%	1.36×
50%	1.44×
90%	1.51×

Higher duplicate ratios show better relative performance (more work saved).

Testing

✅ All 48 GIS tests pass
✅ No breaking changes to API or behavior
✅ Same output, same warnings, just faster
✅ Benchmark script and plots included

Visualizations

The PR includes:

benchmarks/dedup_speedup.png - Speedup factor across test cases
benchmarks/dedup_absolute.png - Absolute performance comparison
benchmarks/dedup_benchmark.R - Reproducible benchmark script

Changes

R/gis_tools.R: Updated duplicate detection in 3 functions
NEWS.md: Documented performance improvement
benchmarks/: Added microbenchmark analysis

Verification

# Run benchmarks yourself
source("benchmarks/dedup_benchmark.R")

# Run tests
devtools::test(filter = "gis")

Series Progress

✅ Phase 1: Quick fixes and code quality (#43)
✅ Phase 2: gh_covering optimization - 2-3× speedup (#44)
✅ Phase 3: Duplicate detection optimization (this PR)
🔜 Phase 4: CI/CD modernization (GitHub Actions)

🤖 Generated with Claude Code

Replaced double-scan approach (anyDuplicated + which(duplicated)) with single-pass (duplicated only) in gh_to_sp, gh_to_spdf.default, and gh_to_spdf.data.frame. Performance improvements: - Small inputs (n=100): 1.68-1.76× faster - Medium inputs (n=1k-10k): 1.37-1.55× faster - Large inputs (n=100k): 1.25-1.48× faster Median speedup: 1.44× across all input sizes and duplicate ratios Better performance with higher duplicate ratios (1.51× at 90% duplicates) All existing tests pass - no breaking changes to API or behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

dshkol mentioned this pull request Nov 11, 2025

Phase 4: Migrate from Travis CI to GitHub Actions #46

Open

Add benchmarks to .Rbuildignore

0bca7b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3: Optimize Duplicate Detection (1.25-1.76× Speedup)#45

Phase 3: Optimize Duplicate Detection (1.25-1.76× Speedup)#45
dshkol wants to merge 2 commits into
masterfrom
perf/single-pass-dedup

dshkol commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dshkol commented Nov 11, 2025

Summary

Problem

Solution

Performance Results

Summary Statistics

By Duplicate Ratio

Testing

Visualizations

Changes

Verification

Series Progress

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant