regex R：用多字符分隔符拆分字符串，保留分隔符

mnemlml8 于 2023-01-21 发布在其他

关注(0)|答案(2)|浏览(191)

尝试在R中解析一个相当复杂的字符串，它需要通过多字符向量拆分字符串，并在拆分前后保留分隔符的各个部分。
用语言描述：

我有一个由多个条目组成的长字符串。每个条目都以不同长度的数字开头，后跟“\t”。
每个条目包含多个段落，我也想拆分。段落结尾遵循以下模式：字符、句点、字符（不带空格）
我希望拆分每个条目，将条目编号保留在条目的开头
我想把每一段分开，句号保留在第一段的末尾。

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."

# desired output
[[1]] "1\tThis is a sentence. This is still part of the first paragraph."
[[2]] "This is now the second paragraph."
[[3]] "10t\This is sentence number 1 of the tenth entry."
[[4]] "This is the second paragraph now. Still the second paragraph."

我已经找到了一些here的答案，但是我还不能将其扩展到多字符分隔符。

regex

来源：https://stackoverflow.com/questions/75189811/r-split-string-by-multi-character-delimiter-and-keep-the-delimiter

2条答案

按热度按时间

zsohkypk1#

这里有一个可能的方法。
regex中的\w是单词字符，它将匹配字母、数字或下划线，(\\w\\.)(\\w)将搜索2个单词字符之间有.“”的模式，圆括号将此匹配分为2个可以引用组。"\\1###\\2"是替换模式，其中\1 & \2引用上一个匹配中的正则表达式组。所以它在拆分发生的地方添加了一个虚拟分隔符，然后我们可以按###拆分，而不删除任何原始内容。

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."
input |> gsub("(\\w\\.)(\\w)", "\\1###\\2", x = _) |> 
         strsplit("###", fixed = T) |> unlist()
#> [1] "1\tThis is a sentence. This is still part of the first paragraph."
#> [2] "This is now the second paragraph."                                
#> [3] "10\tThis is sentence number 1 of the tenth entry."                
#> [4] "This is the second sentence now. Still the second paragraph."

创建于2023年1月21日，使用reprex v2.0.2

赞(0）回复(0）举报 2023-01-21

bhmjp9jg2#

使用strsplit，但在捕获组上具有后备。

strsplit(input, '(?<=(\\.(?=\\w)))', perl=TRUE) |> unlist()
# [1] "1\tThis is a sentence. This is still part of the first paragraph."
# [2] "This is now the second paragraph."                                
# [3] "10\tThis is sentence number 1 of the tenth entry."                
# [4] "This is the second sentence now. Still the second paragraph."

数据：*

input <- "1\tThis is a sentence. This is still part of the first paragraph.This is now the second paragraph.10\tThis is sentence number 1 of the tenth entry.This is the second sentence now. Still the second paragraph."

赞(0）回复(0）举报 2023-01-21

我来回答

regex R：用多字符分隔符拆分字符串，保留分隔符

2条答案

相关问题

热门标签

最新问答