Perl:Scrape网站以及如何使用Perl Selenium:Chrome从网站下载PDF文件

k5hmc34c  于 2023-10-24  发布在  Perl
关注(0)|答案(1)|浏览(180)

所以我正在学习使用Selenium刮网站:Perl上的Chrome,我只是想知道我如何从2017年到2021年下载所有PDF文件并将其存储到这个网站https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021的文件夹中。到目前为止,这是我所做的

use strict;
use warnings;
use Time::Piece;
use POSIX qw(strftime);
use Selenium::Chrome;
use File::Slurp;
use File::Copy qw(copy);
use File::Path;
use File::Path qw(make_path remove_tree);
use LWP::Simple;

my $collection_name = "mre_zen_test3";
make_path("$collection_name");

#DECLARE SELENIUM DRIVER
my $driver = Selenium::Chrome->new;

#NAVIGATE TO SITE
print "trying to get toc_url\n";
$driver->navigate('https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021');
sleep(8);

#GET PAGE SOURCE
my $toc_content = $driver->get_page_source();
$toc_content =~ s/[^\x00-\x7f]//g;
write_file("toc.html", $toc_content);
print "writing toc.html\n";
sleep(5);
$toc_content = read_file("toc.html");

此脚本只下载网站的全部内容。希望有人在这里可以帮助我,教我。非常感谢。

ngynwnxp

ngynwnxp1#

这里有一些工作代码,希望能帮助您开始

use warnings;
use strict;
use feature 'say';
use Path::Tiny;  # only convenience

use Selenium::Chrome;

my $base_url = q(https://www.fda.gov/drugs/)
    . q(warning-letters-and-notice-violation-letters-pharmaceutical-companies/);

my $show = 1;  # to see navigation. set to false for headless operation
    
# A little demo of how to set some browser options
my %chrome_capab = do {
    my @cfg = ($show) 
        ? ('window-position=960,10', 'window-size=950,1180')
        : 'headless';
    'extra_capabilities' => { 'goog:chromeOptions' => { args => [ @cfg ] } }
};

my $drv = Selenium::Chrome->new( %chrome_capab );

my @years = 2017..2021;
foreach my $year (@years) {
    my $url = $base_url . "untitled-letters-$year";

    $drv->get($url);

    say "\nPage title: ", $drv->get_title;
    sleep 1 if $show;

    my $elem = $drv->find_element(
        q{//li[contains(text(), 'PDF')]/a[contains(text(), 'Untitled Letter')]}
    );
    sleep 1 if $show;
    
    # Downloading the file is surprisingly not simple with Selenium itself 
    # (see text). But as we found the link we can get its url and then use 
    # Selenium-provided user-agent (it's LWP::UserAgent)
    my $href = $elem->get_attribute('href');
    say "pdf's url: $href";

    my $response = $drv->ua->get($href);
    die $response->status_line if not $response->is_success;

    say "Downloading 'Content-Type': ", $response->header('Content-Type'); 
    my $filename = "download_$year.pdf";
    say "Save as $filename";
    path($filename)->spew( $response->decoded_content );
}

这需要走捷径,切换方法,并回避一些问题(需要解决这个有用的工具的更全面的实用性)。

my @hrefs = 
    map { $_->get_attribute('href') } 
    $drv->find_elements(
        # There's no ends-with(...) in XPath 1.0 (nor matches() with regex)
        q{//li[contains(text(), '(PDF)')]}
      . q{/a[starts-with(@href, '/media/') and contains(@href, '/download')]} 
    );

现在循环链接,更仔细地形成文件名,并像上面的程序一样下载每个链接。如果需要的话,我可以进一步填补空白。
该代码将pdf文件放在磁盘上,在其工作目录中。请在运行此之前检查,以确保没有被覆盖!
请参阅Selenium::Remote::Driver的初学者。
注意事项:这个任务不需要Selenium;它都是直接的HTTP请求,没有JavaScript。所以LWP::UserAgentMojo就可以了。但是我认为你想学习如何使用Selenium,因为它经常被需要并且很有用。

相关问题