regex R:用多字符分隔符拆分字符串,保留分隔符

mnemlml8  于 2023-01-21  发布在  其他
关注(0)|答案(2)|浏览(191)

尝试在R中解析一个相当复杂的字符串,它需要通过多字符向量拆分字符串,并在拆分前后保留分隔符的各个部分。
用语言描述:

  • 我有一个由多个条目组成的长字符串。每个条目都以不同长度的数字开头,后跟“\t”。
  • 每个条目包含多个段落,我也想拆分。段落结尾遵循以下模式:字符、句点、字符(不带空格)
  • 我希望拆分每个条目,将条目编号保留在条目的开头
  • 我想把每一段分开,句号保留在第一段的末尾。
input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."

# desired output
[[1]] "1\tThis is a sentence. This is still part of the first paragraph."
[[2]] "This is now the second paragraph."
[[3]] "10t\This is sentence number 1 of the tenth entry."
[[4]] "This is the second paragraph now. Still the second paragraph."

我已经找到了一些here的答案,但是我还不能将其扩展到多字符分隔符。

zsohkypk

zsohkypk1#

这里有一个可能的方法。
regex中的\w是单词字符,它将匹配字母、数字或下划线,(\\w\\.)(\\w)将搜索2个单词字符之间有.“”的模式,圆括号将此匹配分为2个可以引用组。"\\1###\\2"是替换模式,其中\1 & \2引用上一个匹配中的正则表达式组。所以它在拆分发生的地方添加了一个虚拟分隔符,然后我们可以按###拆分,而不删除任何原始内容。

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
input |> gsub("(\\w\\.)(\\w)", "\\1###\\2", x = _) |> 
         strsplit("###", fixed = T) |> unlist()
#> [1] "1\tThis is a sentence. This is still part of the first paragraph."
#> [2] "This is now the second paragraph."                                
#> [3] "10\tThis is sentence number 1 of the tenth entry."                
#> [4] "This is the second sentence now. Still the second paragraph."

创建于2023年1月21日,使用reprex v2.0.2

bhmjp9jg

bhmjp9jg2#

使用strsplit,但在捕获组上具有后备。

strsplit(input, '(?<=(\\.(?=\\w)))', perl=TRUE) |> unlist()
# [1] "1\tThis is a sentence. This is still part of the first paragraph."
# [2] "This is now the second paragraph."                                
# [3] "10\tThis is sentence number 1 of the tenth entry."                
# [4] "This is the second sentence now. Still the second paragraph."
  • 数据:*
input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."

相关问题