separate_wide_regex带前瞻

2g32fytz  于 2023-06-27  发布在  其他
关注(0)|答案(1)|浏览(113)

我有一个体育赛事的 Dataframe (没有关于空格或单词数量的假设),有一个可选的年份,可以用几种不同的方式格式化。
tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))
如何使用tidyr::separate_wider_regexevent_optional_year拆分为两列eventyear?在本例中,我希望event去掉可选年份,year分别等于NA1220162020/2021
我试着在正则表达式中处理正向的lookahead:

tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |> 
    tidyr::separate_wider_regex(
      "event_optional_year",
      c(
        event = ".*(?=(?:\\d.*\\d$)?)",
        year = "\\d.*\\d$"
      ),
      too_few = "align_start"
    )

但这给出结果:

event                 year 
  <chr>                 <chr>
1 "World Championships" NA   
2 "Summer Olympics "    12   
3 "Olympics 20"         16   
4 "Olympics 2020/2"     21

问题:哪个正则表达式能给予我想要的结果?

cx6n0qe3

cx6n0qe31#

separate_wider_regex()中的未命名模式稍微简化了这种情况。event = ".*"是贪婪的,它匹配"\\s+(?=\\d)"之前的所有内容--后面跟一个数字的任意数量的空格(假设 year-part以一个数字开头)。这将处理 event 中的空格,但假设 year 中没有空格。

library(dplyr)
library(tidyr)
tibble(event_optional_year = c("World Championships", 
                               "Summer Olympics 12", 
                               "Olympics 2016", 
                               "Olympics 2020/221")) %>% 
  separate_wider_regex(event_optional_year, 
                       c(event = ".*", "\\s+(?=\\d)", year = ".*$") , 
                       too_few = "align_start")
#> # A tibble: 4 × 2
#>   event               year    
#>   <chr>               <chr>   
#> 1 World Championships <NA>    
#> 2 Summer Olympics     12      
#> 3 Olympics            2016    
#> 4 Olympics            2020/221

创建于2023-06-25带有reprex v2.0.2

相关问题