Description
Working with regex syntax is always hard but should not be necessary for most of the simple routes.
For example, a common pattern '[^/]+'
in path segment should be abstracted for better readability and new route addition.
We could re-use some ideas from the recent Django's URL route system that now replaces the old regex system: https://docs.djangoproject.com/en/3.1/topics/http/urls/#url-dispatcher
This system abstracts the regex complexity into "converters", for example r'^articles/(?P<year>[0-9]{4})/$'
becomes articles/<yyyy:year>/
Using a current url2purl example:
pattern = r"https?://raw.githubusercontent.com/(?P<namespace>[^/]+)/(?P<name>[^/]+)/(?P<version>[^/]+)/(?P<subpath>.*)$"
Could become:
route = "https://raw.githubusercontent.com/<str:namespace>/<str:name>/<str:version>/<path:subpath>"
Much easier to write and to read.
Playing around with the Django's _route_to_regex
from django.urls.resolvers import _route_to_regex
route = "https://raw.githubusercontent.com/<str:namespace>/<str:name>/<str:version>/<path:subpath>"
pattern = _route_to_regex(route, is_endpoint=True)[0]
# -> "^https\\:\\/\\/raw\\.githubusercontent\\.com\\/(?P<namespace>[^/]+)\\/(?P<name>[^/]+)\\/(?P<version>[^/]+)\\/(?P<subpath>.+)$"
url = "https://raw.githubusercontent.com/LeZuse/flex-sdk/master/frameworks/projects/mx/src/mx/containers/accordionClasses/AccordionHeader.as"
re.compile(pattern, re.VERBOSE).match(url).groupdict()
# -> {'namespace': 'LeZuse', 'name': 'flex-sdk', 'version': 'master', 'subpath': 'frameworks/projects/mx/src/mx/containers/accordionClasses/AccordionHeader.as'}
We could add custom converter for the specific needs of purl https://docs.djangoproject.com/en/3.1/topics/http/urls/#registering-custom-path-converters
Some parts like the (http|https)
will need support as well as the domain section is not part of the Django system:
from django.urls.resolvers import _route_to_regex
from django.urls.converters import register_converter
from django.urls.converters import StringConverter
class ProtocolConverter(StringConverter):
regex = '(http|https|ftp)'
register_converter(ProtocolConverter, 'protocol')
route = "<protocol:protocol>://raw.githubusercontent.com/<str:namespace>/<str:name>/<str:version>/<path:subpath>"
_route_to_regex(route, is_endpoint=True)[0]
'^(?P<protocol>(http|https|ftp))\\:\\/\\/raw\\.githubusercontent\\.com\\/(?P<namespace>[^/]+)\\/(?P<name>[^/]+)\\/(?P<version>[^/]+)\\/(?P<subpath>.+)$'