如何在R或Python中从一个带有特定单词列表的文本文件中过滤出句子?

qyyhg6bp  于 2023-01-18  发布在  Python
关注(0)|答案(1)|浏览(67)

我很难用RStudio中的特定术语列表正确地过滤掉埃德加S-1财务披露中的句子。
S-1文件中的示例文本。

"We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.

In addition, any significant failure of our computer networks could disrupt our on-campus operations.

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems.

As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations.

As a result, our revenues and profitability may be materially adversely affected."

一个示例术语列表可以是来自以下向量的内容。

terms_list = c("institutions", "disaster", "error",...)

关键是要编辑并覆盖当前文本文件,以删除不包含特定单词或术语(如前面提到的那些)的句子。
过滤和覆盖后,文本应该如下所示。

"We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks. 

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students. 

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures. 

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems. "
hgqdbh6s

hgqdbh6s1#

如果您的数据是一个长字符串,在R中您可以:
1.使用string::str_split拆分字符串
1.使用paste合并搜索词
1.重新组合字符串
一个使用您的数据的示例,读入为:

strng <- "We run the online operations of our institutions on different platforms, which are in various stages of development. 

The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. 

Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.

In addition, any significant failure of our computer networks could disrupt our on-campus operations.

Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.

Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.

The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems.

As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations.

As a result, our revenues and profitability may be materially adversely affected."

这里每个句子都用\n\n分隔,所以我们可以按照那个模式拆分字符串,如果实际数据中有另一个模式,只需替换(即句点)。

strngSplit <- stringr::str_split(strng, "\\\n\\\n")[[1]]

# [1] "We run the online operations of our institutions on different platforms, which are in various stages of development. "                                                                                                                                                               
# [2] "The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. "                                                                                                                      
# [3] "Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks."                                                                                           
# [4] "In addition, any significant failure of our computer networks could disrupt our on-campus operations."                                                                                                                                                                               
# [5] "Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students."                                                                                                     
# [6] "Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures."                        
# [7] "The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."
# [8] "As a result of any of these events, we may not be able to conduct normal business operations and may be required to incur significant expenses in order to resume normal business operations."                                                                                       
# [9] "As a result, our revenues and profitability may be materially adversely affected."

确定检索词

terms_list <- c("institutions", "disaster", "error")

使用搜索词查找句子

idx <- grep(paste0(terms_list, collapse = "|"), strngSplit)
# [1] 1 2 3 5 6 7

您可以将其保留为向量(每个句子位于向量的某个位置),也可以使用以下内容将其合并回段落:

strngVec <- strngSplit[idx]
# [1] "We run the online operations of our institutions on different platforms, which are in various stages of development. "                                                                                                                                                               
# [2] "The performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. "                                                                                                                      
# [3] "Any computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks."                                                                                           
# [4] "Individual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students."                                                                                                     
# [5] "Additionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures."                        
# [6] "The disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."

# or

strngParagraph <- paste(strngSplit[idx], collapse = "\n\n")
#[1] "We run the online operations of our institutions on different platforms, which are in various stages of development. \n\nThe performance and reliability of these online operations are critical to the reputation of our institutions and our ability to attract and retain students. \n\nAny computer system error or failure, or a sudden and significant increase in traffic on our institutions' computer networks may result in the unavailability of these computer networks.\n\nIndividual, sustained or repeated occurrences could significantly damage the reputation of our institutions' operations and result in a loss of potential or existing students.\n\nAdditionally, the computer systems and operations of our institutions are vulnerable to interruption or malfunction due to events beyond our control, including natural disasters and other catastrophic events and network and telecommunications failures.\n\nThe disaster recovery plans and backup systems that we have in place may not be effective in addressing a natural disaster or catastrophic event that results in the destruction or disruption of any of our critical business or information technology and infrastructure systems."

相关问题