R语言 如何编写正确的请求(http::GET或http::POST)

ikfrs5lh  于 2022-12-25  发布在  其他
关注(0)|答案(1)|浏览(313)

我想使用R提取网页上"信息"下列出的links。数据是公开的,不禁止抓取。
With an empty search on https://fsca.swissmedic.ch/mep/#/ and >export results I get a CSV. However, this CSV does not include what I need (links under "Information"). I thought that I could use this CSV (with unique identifiers, e.g., Vk_20220224_16 ) to programmatically open the pages separately (e.g., https://fsca.swissmedic.ch/mep/#/?q=Vk_20220224_16) and then extract these links (with a function using html_attr("href") etc.).
不幸的是,我无法获得单独页面的内容。当我使用httr:GET(url)时,我得到一个错误消息(400坏请求)。
我想我得到这个错误信息是因为我的请求没有包含服务器需要的所有参数。有没有办法检查哪些参数是需要的,以便服务器理解我的请求?
示例:

#library
library(httr)

# read html
html <- GET("https://fsca.swissmedic.ch/mep/#/?q=Vk_20220224_16")
html
#> Response [https://fsca.swissmedic.ch/mep/#/?q=Vk_20220224_16]
#>   Date: 2022-12-22 23:13
#>   Status: 400
#>   Content-Type: text/html; charset=iso-8859-1
#>   Size: 347 B
#> <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
#> <html><head>
#> <title>400 Bad Request</title>
#> </head><body>
#> <h1>Bad Request</h1>
#> <p>Your browser sent a request that this server could not understand.<br />
#> </p>
#> <p>Additionally, a 400 Bad Request
#> error was encountered while trying to use an ErrorDocument to handle the requ...
#> </body></html>

创建于2022年12月23日,使用reprex v2.0.2

    • 更新**

我刚刚了解到,我可以使用Firefox检查所需的参数:

因此,我尝试使用httr::POST,但仍然无法获得页面内容/表,只能获得"Loading..."

#library
library(httr)
library(jsonlite)
library(rvest)

# set parameter
body <- list(
  queryTerm="Vk_20220224_16",
  fromDate="",
  toDate="")

# POST
res <- POST(
       "https://fsca.swissmedic.ch/",
       body = jsonlite::toJSON(body),
       encode = "form",
       verbose()
       )

# get results
read_html(res)
#> {html_document}
#> <html>
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\n<mep-app>Loading...</mep-app><script type="text/javascript" src=" ...

创建于2022年12月23日,使用reprex v2.0.2

gudnpqoy

gudnpqoy1#

如何使用httr2请求

library(httr2)
library(tidyverse)

"https://fsca.swissmedic.ch/mep/api/publications/search?pageNumber=0&sortingProperty=PUBLICATION_DATE&direction=DESC" %>%
  request() %>%
  req_body_json(
    list(
      fromDate = "2022-12-04",
      toDate = "2022-12-20",
      queryTerm = NULL,
      onlyUpdates = "false"
    )
  ) %>%
  req_perform() %>%
  resp_body_json(simplifyVector = T) %>%
  pluck("content") %>%
  as_tibble()

# A tibble: 37 × 9
   publikationsDatum swissmedicRef  hersteller                status status…¹ begru…² devices freig…³ docum…⁴
   <chr>             <chr>          <chr>                     <chr>  <chr>    <chr>   <list>  <lgl>   <list> 
 1 2022-12-07        Vk_20221202_03 Medtronic CoreValve LLC   UPDATE 2022-12… "Added… <df>    TRUE    <df>   
 2 2022-12-20        Vk_20221216_12 Biocartis NV              UPDATE 2022-12… "Added… <df>    TRUE    <df>   
 3 2022-12-20        Vk_20221219_01 Siemens Healthcare GmbH   FIRST  2022-12… ""      <df>    TRUE    <df>   
 4 2022-12-20        Vk_20221216_19 Medicvent AB              FIRST  2022-12… ""      <df>    TRUE    <df>   
 5 2022-12-20        Vk_20221213_25 Macopharma                FIRST  2022-12… ""      <df>    TRUE    <df>   
 6 2022-12-20        Vk_20221208_26 Spiegelberg GmbH & Co. KG FIRST  2022-12… ""      <df>    TRUE    <df>   
 7 2022-12-06        Vk_20221201_21 Fujifilm Corporation      UPDATE 2022-12… "Rewor… <df>    TRUE    <df>   
 8 2022-12-20        Vk_20221216_15 Maquet Critical Care AB   FIRST  2022-12… ""      <df>    TRUE    <df>   
 9 2022-12-20        Vk_20221216_17 Siemens Healthcare GmbH   FIRST  2022-12… ""      <df>    TRUE    <df>   
10 2022-12-20        Vk_20221215_03 custo med GmbH            FIRST  2022-12… ""      <df>    TRUE    <df>   
# … with 27 more rows, and abbreviated variable names ¹​statusDatum, ²​begruendung, ³​freigeschaltet,
#   ⁴​documents
# ℹ Use `print(n = ...)` to see more rows

使用搜索参数

"https://fsca.swissmedic.ch/mep/api/publications/search?pageNumber=0&sortingProperty=PUBLICATION_DATE&direction=DESC" %>%
  request() %>%
  req_body_json(list(
    fromDate = NULL,
    toDate = NULL,
    queryTerm = "Vk_20220224_16",
    onlyUpdates = "false"
  )) %>%
  req_perform() %>%
  resp_body_json(simplifyVector = TRUE) %>%
  pluck("content") %>%
  as_tibble() %>%
  unnest(everything()) 

# A tibble: 3 × 16
  publikatio…¹ swiss…² herst…³ status statu…⁴ begru…⁵ hande…⁶ sn    lot   swVer…⁷ model besch…⁸ freig…⁹ title
  <chr>        <chr>   <chr>   <chr>  <chr>   <chr>   <chr>   <chr> <chr> <chr>   <chr> <chr>   <lgl>   <chr>
1 2022-03-07   Vk_202… Siemen… FIRST  2022-0… ""      Artis … ""    ""    ""      ""    MD: St… TRUE    DE-1 
2 2022-03-07   Vk_202… Siemen… FIRST  2022-0… ""      Artis Q ""    ""    ""      ""    MD: St… TRUE    FR-1 
3 2022-03-07   Vk_202… Siemen… FIRST  2022-0… ""      Artis … ""    ""    ""      ""    MD: St… TRUE    IT-1 
# … with 2 more variables: language <chr>, version <chr>, and abbreviated variable names ¹​publikationsDatum,
#   ²​swissmedicRef, ³​hersteller, ⁴​statusDatum, ⁵​begruendung, ⁶​handelsname, ⁷​swVersion, ⁸​beschreibungKlasse,
#   ⁹​freigeschaltet
# ℹ Use `colnames()` to see all variable names

文档的下载链接,可循环/Map为自动下载:

str_c("https://fsca.swissmedic.ch/mep/api/publications/", "Vk_20220224_16", 
      "/documents/", 0:(number_of_documents - 1))  

[1] "https://fsca.swissmedic.ch/mep/api/publications/Vk_20220224_16/documents/0"
[2] "https://fsca.swissmedic.ch/mep/api/publications/Vk_20220224_16/documents/1"
[3] "https://fsca.swissmedic.ch/mep/api/publications/Vk_20220224_16/documents/2"

相关问题