fix(pypi): apply full PEP 503 normalization to package names#35
fix(pypi): apply full PEP 503 normalization to package names#35
Conversation
The old regex only replaced underscores with hyphens, but PEP 503 requires collapsing any run of [-_.] into a single hyphen. Packages like `my.package` or `my--package` were not normalized correctly, which could cause cache key mismatches and failed API lookups. Fixed in both purl.ts (PURL parsing) and pypi.ts (registry calls + dependency name output from parsePEP508).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📜 Recent review details🧰 Additional context used📓 Path-based instructions (8)src/**/*.ts📄 CodeRabbit inference engine (AGENTS.md)
Files:
src/**/!(client).ts📄 CodeRabbit inference engine (src/AGENTS.md)
Files:
src/core/**/*.ts📄 CodeRabbit inference engine (src/core/AGENTS.md)
Files:
test/**/*.test.ts📄 CodeRabbit inference engine (AGENTS.md)
Files:
test/unit/**/*.test.ts📄 CodeRabbit inference engine (test/AGENTS.md)
Files:
test/unit/purl.test.ts📄 CodeRabbit inference engine (test/AGENTS.md)
Files:
src/registries/**/*.ts📄 CodeRabbit inference engine (AGENTS.md)
Files:
src/registries/*.ts📄 CodeRabbit inference engine (src/registries/AGENTS.md)
Files:
🧠 Learnings (11)📓 Common learnings📚 Learning: 2026-03-10T07:36:29.354ZApplied to files:
📚 Learning: 2026-03-10T07:36:12.605ZApplied to files:
📚 Learning: 2026-03-10T07:36:54.862ZApplied to files:
📚 Learning: 2026-03-10T07:36:03.586ZApplied to files:
📚 Learning: 2026-03-10T07:36:38.679ZApplied to files:
📚 Learning: 2026-03-10T07:36:29.354ZApplied to files:
📚 Learning: 2026-03-10T07:36:54.862ZApplied to files:
📚 Learning: 2026-03-10T07:36:12.605ZApplied to files:
📚 Learning: 2026-03-10T07:36:38.679ZApplied to files:
📚 Learning: 2026-03-10T07:36:46.164ZApplied to files:
🧬 Code graph analysis (1)test/unit/purl.test.ts (1)
🔇 Additional comments (3)
📝 WalkthroughWalkthroughCore and registry layers now normalize PyPI package names per PEP 503 by collapsing any sequence of hyphens, underscores, or dots into a single hyphen. Previously only underscores were normalized. Tests verify the new behavior covers edge cases like consecutive separators. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
✨ Simplify code
📝 Coding Plan
Comment |
Sequence DiagramThis PR updates PyPI package normalization to collapse any run of dash, underscore, or dot into a single dash. The diagram shows how both PURL parsing and PyPI dependency handling now produce the same canonical package names, preventing duplicate keys and mismatched lookups. sequenceDiagram
participant Client
participant PURLParser
participant PyPIRegistry
Client->>PURLParser: Parse PyPI purl name
PURLParser->>PURLParser: Normalize with PEP 503 separator collapse
PURLParser-->>Client: Return canonical package name
Client->>PyPIRegistry: Build PyPI identifier from package name
PyPIRegistry->>PyPIRegistry: Normalize with same PEP 503 rule
PyPIRegistry-->>Client: Return canonical project URL and purl
Client->>PyPIRegistry: Parse requires dist dependency entry
PyPIRegistry->>PyPIRegistry: Normalize parsed dependency name
PyPIRegistry-->>Client: Return canonical dependency metadata
Generated by CodeAnt AI |
There was a problem hiding this comment.
No issues found across 3 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Requires human review: Changes core package name normalization logic for PyPI, which may impact data consistency and requires human verification of its impact on existing indexed data and lookups.
Architecture diagram
sequenceDiagram
participant C as Client/Scanner
participant P as PURL Parser
participant R as PyPI Registry
participant DB as Cache/Storage
Note over C,DB: Request Flow for PyPI Package "zope.interface"
C->>P: parsePURL("pkg:pypi/zope.interface")
P->>P: CHANGED: Apply full PEP 503 normalization<br/>(collapse runs of [-._] to single hyphen)
P-->>C: ParsedPURL { name: "zope-interface" }
C->>R: fetchMetadata("zope-interface")
R->>R: normalizeName("zope-interface")
R->>DB: Check cache for "zope-interface"
alt Cache Miss
R->>R: Fetch package JSON from PyPI
loop For each dependency in requires_dist
R->>R: NEW: normalizeName(depName) per PEP 503
end
R->>DB: Store metadata with normalized dependency names
end
DB-->>R: Metadata object
R-->>C: Normalized Package + Dependencies
Package name normalization only replaced underscores with hyphens, but PEP 503 specifies collapsing any run of
[-_.]into a single hyphen. A package likezope.interfaceormy--packagewould pass through with dots and double hyphens intact, which breaks cache key consistency and can cause duplicate entries for the same package.The regex in both
purl.tsandpypi.tsnow matches the reference implementation from PEP 503:re.sub(r"[-_.]+", "-", name).lower(). Also normalized dependency names coming out ofparsePEP508, sorequires_distentries likezope.interface>=5.0producezope-interfaceinstead of the raw dotted form.Test plan
my_-_package) all collapse to single hyphentsc --noEmitclean