-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Problem
I think the package will be incomplete until we find a way to express groups of characters. Here's a challenge to express email pattern matching in rx:
Challenges
First of all, I dont know of the way to express single "word" character (alnum + _). We used rx_word to denote \\w+ and perhaps it should have been rx_word_char() %>% rx_one_or_more().
rx_char <- function(.data = NULL, value=NULL) {
if(missing(value))
return(paste0(.data, "\\w"))
paste0(.data, sanitize(value))
}I also extended rx_count to cases of ranges of input
rx_count <- function(.data = NULL, n = 1) {
if(length(n)>1){
n[is.na(n)]<-""
return(paste0(.data, "{", n[1], "," , n[length(n)], "}"))
}
paste0(.data, "{", n,"}")
}Finally, we dont have a way to express word boundaries (\\b) and it might be useful to denote them. We shall call this function rx_word_edge
rx_word_start <- function(.data = NULL){
paste0(.data, "\\b")
}
rx_word_end <- rx_word_startFinally, our biggest problem is that there's no way to express groups of characters, other than through rx_any_of(), but if we pass other rx expressions, values will be sanitized twice, meaning that we will get four backslashes before each symbol instead of two.
# this function is exactly like rx_any_of() but without sanitization
rx_group <- function(.data = NULL, value) {
paste0(.data, "[", value, "]")
}Solution
Here's what it looks like when we put all pieces together:
x <- rx_word_start() %>%
rx_group(
rx() %>%
rx_char() %>%
rx_char(".%+-")
) %>%
rx_one_or_more() %>%
rx_char("@") %>%
rx_group(
rx() %>%
rx_char() %>%
rx_char(".-")
) %>%
rx_one_or_more() %>%
rx_char(".") %>%
rx_alpha() %>%
rx_count(2:6) %>%
rx_word_end()
x
#> [1] "\\b[\\w\\.%\\+-]+@[\\w\\.-]+\\.[[:alpha:]]{2,6}\\b"
txt <- "This text contains email [email protected] and [email protected]. The latter is no longer valid."
regmatches(txt, gregexpr(x, txt, perl = TRUE))
#> [[1]]
#> [1] "[email protected]" "[email protected]"
stringr::str_extract_all(txt, x)
#> [[1]]
#> [1] "[email protected]" "[email protected]" The code works but I don't like it.
- Constructor
rxlook redundant (I believe, there's a way to get rid of it entirely using specialized class, see below). - It is not very clear what
rx_one_or_more()is referring to. I wonder if all functions should haverepargument with default optiononeand optionssome/anyin addition to whatrx_countdoes today. - Should
rx_char()without arguments be calledrx_wordchar? - Should
rx_char()with arguments be calledrx_literal()orrx_plain? - We should be very explicit about sanitization of arguments. To the extent that we should just mention: "input will be sanitized".
rx_groupis artificial construct, a duplicate ofrx_any_of, but without sanitization. Here I see couple of solutions.
a. Allow "nested pipes" (as I have done above). Create S3 class and this way detect when type ofvalueargument is not character, butrx_string. Input of this class do not need to be sanitized, because it has been sanitized at creation.
b. Do not allow "nested pipes". Instead definerx_any_of()to have...and allow multiple arguments mixing functions and characters. Then hypotherical pipe would look like this:
rx_word_edge() %>%
rx_any_of(rx_wordchar(), ".%+-", rep="some") %>%
rx_literal("@") %>%
rx_any_of(rx_wordchar(), ".-", rep="some") %>%
rx_literal(".") %>%
rx_alpha(rep=2:6) %>%
rx_word_edge()It's a lot to digest, but somehow everything related to one particular problem. Happy to split the issue once we identify the issues worth tackling.
