我的问题如下:
data_example <-
c("Creditshelf Aktiengesellschaft / Key word(s): Forecast/Development of Sales\n\ncreditshelf Aktiengesellschaft",
"Swiss Life Holding AG / Key word(s): 9 Month figures\n\nSwiss Life increases fee income by 13%",
"tonies SE / Key word(s): Capital Increase\n\ntonies SE: tonies successfully places 12,000,000 new class A shares",
"init innovation in traffic systems SE / Key word(s): Contract/Incoming Orders\n\ninit innovation in traffic systems SEs")
strings_to_extract <-
c("Key word(s): Word1/Word2",
"Key word(s): Word1/Word2 Word3",
"Key word(s): Word1 Word2 Word3",
"Key word(s): Word1/Word2/Word3",
"Key word(s): Number Word1/Word2",
"Key word(s): Number Word1 Word2",
"Key word(s): Word1 Number Word2")
总是会有一个空格或“/”来分隔它们。我的尝试看起来像这样:
str_extract(data, "Key word[[:punct:]]{1}s[[:punct:]]{2} [[:alpha:]]{1,}|Key word[[:punct:]]{1}s[[:punct:]]{2} [[:alpha:]]{1,}[[:punct:]]{1,}[[:alpha:]]{1,}Key word[[:punct:]]{1}s[[:punct:]]{2} [[:alpha:]]{1,}[[:punct:]]{1,}[[:alpha:]]{1,}[[:punct:]]{1,}[[:alpha:]]{1,}")
我的意思是我抓住了主题的一个很好的部分,但是我认为它太复杂了。有人能给予我一个建议如何做得更好吗?
泰国和韩国
3条答案
按热度按时间mkh04yzy1#
您可以使用
请参阅regex demo。
Key word\(s\):
\s*
-零个或多个空格\w+
-一个或多个字字符(?:\W+\w+){1,2}
-一个或两个由一个或多个非字字符组成的序列,后跟一个或多个字字符。j9per5c42#
您的示例数据也适合使用不同的方法,因为您的关键字总是以
\n
结尾。在这种情况下,您可以执行以下操作:
Key word\\(s\\):
与Key word(s):
匹配,.+(?=\\n)
与所有字符匹配:由\n
接续的.+
:注意R中需要的双转义(\\
)。qnzebej03#
如果您不想包含短语“关键字:“,则可以执行以下操作: