使用readtext包从XML文件导入文本和文档变量

7ajki6be  于 2023-03-20  发布在  其他
关注(0)|答案(1)|浏览(140)

我试着用readtext包从xml文件导入文本,然后用quanteda创建和探索语料库,阅读帮助页面后我知道了如何导入文本,但我想知道是否可以基于xml文件中的节点属性创建docvar。
让我们想象一个XML文件:

<corpus>
  <text author="Bill" date="1928-05-27">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor. Cras elementum ultrices diam. Maecenas ligula massa, varius a, semper congue, euismod non, mi. Proin porttitor, orci nec nonummy molestie, enim est eleifend mi, non fermentum diam nisl sit amet erat. Duis semper. Duis arcu massa, scelerisque vitae, consequat in, pretium a, enim. Pellentesque congue. Ut in risus volutpat libero pharetra tempor. Cras vestibulum bibendum augue.
  </text>
</corpus>

可以使用xpath表达式将文本节点的内容作为文本字段导入:

library(readtext)
texts <- readtext("file.xml", text_field = ".//text", encoding = "utf-8", verbosity = 3)

但是我不知道是否可以将节点属性作为docvars(在本例中是作者和日期)?
如果是这样,帮助实现这一目标将是非常好的!

jgovgodb

jgovgodb1#

readtext()本身似乎并不支持它,但假设每个文件只有一个语料库,您可以使用xml2提取属性,然后将这些属性添加到readtext对象中:

library(readtext)
library(xml2)
library(dplyr)

## for a single file:
texts <- readtext("file.xml", text_field = ".//text", encoding = "utf-8", verbosity = 3)
#> Reading texts from file.xml
#> , using glob pattern
#>  ... reading (xml) file: file.xml
#>  ... read 1 document
read_xml("file.xml") %>% 
  xml_find_first(".//text") %>% 
  xml_attrs() %>% 
  as.list() %>% 
  bind_cols(texts, .)
#> readtext object consisting of 1 document and 2 docvars.
#> # Description: df [1 × 4]
#>   doc_id   text                 author date      
#>   <chr>    <chr>                <chr>  <chr>     
#> 1 file.xml "\"\nLorem ips\"..." Bill   1928-05-27
## for a list of files:
library(purrr)
list.files(pattern = "file.*\\.xml") %>% 
  map(\(x) 
      bind_cols(
        readtext(x, text_field = ".//text", encoding = "utf-8"),
        read_xml(x) %>%  xml_find_first(".//text") %>%  xml_attrs() %>%  as.list())
      ) %>% 
  list_rbind()
#> readtext object consisting of 2 documents and 2 docvars.
#> # Description: df [2 × 4]
#>   doc_id    text                 author date      
#>   <chr>     <chr>                <chr>  <chr>     
#> 1 file.xml  "\"\nLorem ips\"..." Bill   1928-05-27
#> 2 file2.xml "\"\nSed non r\"..." Gill   1998-05-27

创建于2023年3月16日,使用reprex v2.0.2

相关问题