R语言 使用xpath从网站上的表中抓取数据

xjreopfe  于 2023-02-01  发布在  其他
关注(0)|答案(1)|浏览(161)

我正在尝试从以下网站的所有颜色的表中获取数据:https://azdeq.gov/aq/ytd?year=2022&pollutant=pm25&location=pinal&type=conc#mtop
我是这么做的。
1.检查元素和发现表
1.复制表的XPath://*[@id=“节点-5748”]/格/格/格/格/格[5]
1.在这段简单的代码上花的时间比我希望的要多
1.表为空...使用css和选择器的结果相同
1.我已经使用了其他方法来访问一些数据,但是空白没有显示出来,也没有把东西扔掉。
任何帮助都将不胜感激。

library(rvest)

# Scrape the table from the website
table <- read_html("https://azdeq.gov/aq/ytd?year=2022&pollutant=pm25&location=pinal&type=conc#mtop") %>%
  html_nodes(xpath='//*[@id="node-5748"]/div/div/div/div/div[5]') %>%
  html_table()
azpvetkf

azpvetkf1#

问题是数据并没有存储在实际的HTML表中,而是存储在一堆div标签中,因此html_table()似乎无法解析这些数据,您可以自己进行一些处理。

library(rvest)
page <-read_html("https://azdeq.gov/aq/ytd?year=2022&pollutant=pm25&location=pinal&type=conc#mtop")
block <- html_nodes(page, "div.divPollYTD") %>% `[[`(2)

lapply(block %>% html_elements(".divPollRowYTD"), function(row)
  row %>% html_elements("div") %>% html_text()
) |> 
  do.call("rbind", args=_)
#       [,1]  [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]   [,10]  [,11]  [,12]  [,13] 
#  [1,] "Day" "Jan"  "Feb"  "Mar"  "Apr"  "May"  "Jun"  "Jul"  "Aug"  "Sep"  "Oct"  "Nov"  "Dec" 
#  [2,] "1"   "7.1"  "11.2" "6.9"  "5.7"  "13"   "20.3" "8.7"  "3.4"  "13.2" "3"    "8.4"  "7"   
#  [3,] "2"   "6.5"  "10.3" "15.1" "5.6"  "14.7" "18.9" "13.2" "3.7"  "15.2" "4.9"  "13.5" "8.2" 
#  [4,] "3"   "6.2"  "11"   "10.9" "5.3"  "12.4" "14.7" "7.6"  "5.1"  "3.5"  "57.8" "7.7"  "7.1" 
#  [5,] "4"   "8.3"  "11.7" "6.7"  "6.7"  "7.4"  "11.2" "10.5" "2.2"  "10.5" "4.9"  "6.9"  "3.7" 
#  [6,] "5"   "13.6" "7.1"  "9.4"  "6.8"  "16"   "8.9"  "7.2"  "4"    "19.5" "6.6"  "9.5"  "3.4" 
#   etc...

这将返回一个字符数组,但您可以将其强制转换为data.frame或任何其他类型。

相关问题