重定向到PDF文件时出现Scrapy error:AttributeError:响应内容不是文本

jslywgbw  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(139)

我在Zyte上用智能代理托管了一个小蜘蛛。
我的蜘蛛是相当简单的,因为它从一个URL列表开始爬行。
解析方法使用简单的链接提取器来提取域上的链接,然后爬取这些链接。
简化的解析方法:

def parse(self, response):    
    internal_le = LinkExtractor(
            allow_domains=tld_t, # try to stay on domain (this is a tldextract of response.url)
            unique=True,  # de-dup
            #deny_extensions=self.deny_extensions
        )
    in_links = internal_le.extract_links(response)

    for link in in_links:
            if link.url:
                
                yield Request(
                    link.url,
                    callback=self.parse,
                    
                )

字符串
由于deny_extensions默认为scrapy.DENY_EXTENSIONS,其中包括PDF文件,我认为它不会抓取PDF链接。但是,我有内部链接被重定向到外部托管的PDF文件。
下面是一些摘录日志的例子:

33: 2023-11-27 23:41:01 ERROR   [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073836/usd262net/renyendq5njmpmol8iko/2023-2024USD262ElementarySchoolStudentHandbookFinaldocx.pdf> (referer: https://west.usd262.net/about) More 
34: 2023-11-27 23:41:02 ERROR   [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/files/v1676910235/usd262net/kgtnfuk7buzu8zthtixk/102422RevisedSpanish22-23ElementaryHandbookSP4.docx> (referer: https://west.usd262.net/about) More
35: 2023-11-27 23:41:05 ERROR   [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1676649887/usd262net/adlo2wuxxpqa7pmnxmkx/MiddleSchoolBellSchedule22_23docx.pdf> (referer: https://vcms.usd262.net/about) More 
36: 2023-11-27 23:41:10 ERROR   [scrapy.core.scraper] Spider error processing <GET https://resources.finalsite.net/images/v1691073617/usd262net/zjuysts6fymaf5gjumlc/VCMSStudentHandbook23-24Finaldocx.pdf> (referer: https://vcms.usd262.net/about) More


这是一个单一的跟踪:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/defer.py", line 279, in iter_errback
    yield next(it)
          ^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
    return next(self.data)
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/python.py", line 350, in __next__
    return next(self.data)
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/sh_scrapy/middlewares.py", line 30, in process_spider_output
    for x in result:
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in <genexpr>
    return (r for r in result or () if self._filter(r, spider))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 352, in <genexpr>
    return (self._set_referer(r, response) for r in result or ())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 27, in <genexpr>
    return (r for r in result or () if self._filter(r, spider))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/usr/local/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 31, in <genexpr>
    return (r for r in result or () if self._filter(r, response, spider))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 106, in process_sync
    for r in iterable:
  File "/tmp/unpacked-eggs/__main__.egg/edtech/spiders/edcrawler.py", line 117, in parse
    ex_links = external_le.extract_links(response)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/linkextractors/lxmlhtml.py", line 239, in extract_links
    base_url = get_base_url(response)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/utils/response.py", line 26, in get_base_url
    text = response.text[0:4096]
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/scrapy/http/response/__init__.py", line 137, in text
    raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text


我已经尝试了各种方法来改变我的链接提取器,但大概链接看起来很好的链接提取器。它的重定向,有PDF文件得到下载和产生的错误。
示例起始URL start url
该页面上的链接提取到'in_links' extracted internal link
重定向redirect to a pdf document on web host
我唯一能想到的解决这个问题的方法是使用一个自定义的中间件来替换重定向,并在request. url中查找r”.pdf$”。
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect
我错过了什么吗?使用最新的scrapy 2.11.0.此外,在scrapy github github/6159上记录的问题。
1:scrapy docs.redirect middleware

m1m5dgzv

m1m5dgzv1#

我认为在这种情况下,最好的选择是子类化RedirectMiddleware,并简单地添加几行代码,检查.pdf扩展的初始响应的Location头,并在发现时引发IgnoreRequest异常。
这一切都可以在短短几行中完成。
范例:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
from scrapy.exceptions import IgnoreRequest

class PDFRedirect(RedirectMiddleware):

    def process_response(self, request, response, spider):
        location = response.headers.get("Location", b"").decode()
        if location.lower().endswith(".pdf") or location.lower().endswith(".docx"):
            print(f"IGNORING PDF {location}")
            raise IgnoreRequest("max redirections reached")
        return super().process_response(request, response, spider)

class PdfRedirectSpider(scrapy.Spider):
    name = 'nopdfs'
    allowed_domains = ['west.usd262.net']
    start_urls = ['https://west.usd262.net/about']

    custom_settings = {
        "DOWNLOADER_MIDDLEWARES" : {
            "scrapy.downloadermiddlewares.redirect.RedirectMiddleware":None,
            PDFRedirect: 600,
        }
    }

    def parse(self, response):
        internal_le = LinkExtractor(unique=True)
        in_links = internal_le.extract_links(response)
        for link in in_links:
                if link.url:
                    yield scrapy.Request(link.url, callback=self.parse)

字符串
输出

2023-11-30 15:00:35 [scrapy.core.engine] INFO: Spider opened
2023-11-30 15:00:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-30 15:00:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-30 15:00:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about> (referer: None)
2023-11-30 15:00:37 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://west.usd262.net/about> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.usd262.net': <GET https://www.usd262.net/staff-links1>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'abilene.usd262.net': <GET https://abilene.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'wheatland.usd262.net': <GET https://wheatland.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vcis.usd262.net': <GET https://vcis.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vcms.usd262.net': <GET https://vcms.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'vchs.usd262.net': <GET https://vchs.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tlc.usd262.net': <GET https://tlc.usd262.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.facebook.com': <GET https://www.facebook.com/profile.php?id=100061273524317>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/USD262>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.youtube.com': <GET https://www.youtube.com/channel/UCD8AdyKpM44gpFzqIqBG9tw>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net-22-us-central1-01.preview.finalsitecdn.com': <GET https://usd262net-22-us-central1-01.preview.finalsitecdn.com/about/calendar1>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.finalsite.com': <GET https://www.finalsite.com>
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about#fsPageContent> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/files/v1676910235/usd262net/kgtnfuk7buzu8zthtixk/102422RevisedSpanish22-23ElementaryHandbookSP4.docx
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/privacy-policy> (referer: https://west.usd262.net/about)
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/images/v1686234716/usd262net/hdkhsv6qg1jzbobmkrxs/23-24elementaryschoolsupplylist8511in.pdf
2023-11-30 15:00:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/contact645-clone> from <GET https://west.usd262.net/fs/pages/3813>
2023-11-30 15:00:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/accessibility-statement> (referer: https://west.usd262.net/about)
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.valleycenterhornets.net': <GET https://www.valleycenterhornets.net>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sideline.bsnsports.com': <GET https://sideline.bsnsports.com/schools/kansas/valleycenter/valley-center-high-school/design/picker>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net-34-us-central1-01.preview.finalsitecdn.com': <GET https://usd262net-34-us-central1-01.preview.finalsitecdn.com/about>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'calendar.google.com': <GET https://calendar.google.com/calendar/embed?src=usd262.net_b07qmrijq7dq09a7s93u4qq7u0%40group.calendar.google.com&ctz=America%2FChicago>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'datacentral.ksde.org': <GET https://datacentral.ksde.org/accountability.aspx>
IGNORING PDF https://resources.finalsite.net/images/v1691073836/usd262net/renyendq5njmpmol8iko/2023-2024USD262ElementarySchoolStudentHandbookFinaldocx.pdf
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.w3.org': <GET http://www.w3.org/TR/WCAG/>
2023-11-30 15:00:37 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'accessibilitystatementgenerator.com': <GET http://accessibilitystatementgenerator.com>
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/parent756> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/pto> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/site-map> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/footer-links> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.infinitecampus.org': <GET https://usd262.infinitecampus.org/campus/portal/valleycenter.jsp>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262net.finalsite.com': <GET https://usd262net.finalsite.com/fs/resource-manager/view/383a8f18-5ef9-4f48-815e-030300759293>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'docs.google.com': <GET https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRi840waukqIIVzL9eM4X9EoxwIsGKyuwsu83A852Mv6dMnPmjQSF0HKFRrMmpw1g/pubhtml>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.incidentiq.com': <GET https://usd262.incidentiq.com/>
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'educatekansas.org': <GET https://educatekansas.org/>
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/volunteering> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/ymca-childcare> (referer: https://west.usd262.net/about)
2023-11-30 15:00:38 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'ymcawichita.org': <GET https://ymcawichita.org/programs/child-care-and-camps/before-and-after-school>
2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/emergency-safety-interventions-bullying> (referer: https://west.usd262.net/about)
2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/librarymedia-center> (referer: https://west.usd262.net/about)
2023-11-30 15:00:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/volunteer-information> (referer: https://west.usd262.net/about)
2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'search.follettsoftware.com': <GET https://search.follettsoftware.com/metasearch/ui/43691>
2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'bookfairs.scholastic.com': <GET https://bookfairs.scholastic.com/bf/westelementaryschool11>
2023-11-30 15:00:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.commonsensemedia.org': <GET https://www.commonsensemedia.org/>
2023-11-30 15:00:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/news> from <GET https://west.usd262.net/fs/pages/3814>
IGNORING PDF https://resources.finalsite.net/images/v1680193574/usd262net/skenieqeiwealjrpl210/33023ActivationInstructionforCampusPortal3.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673804004/usd262net/i0mi93dw4rp63jsem0jt/PTOMeetingMinutes1220docx.pdf
2023-11-30 15:00:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/contact645-clone> (referer: https://west.usd262.net/about)
2023-11-30 15:00:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/sraff-directory> (referer: https://west.usd262.net/about)
2023-11-30 15:00:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/schools> from <GET https://west.usd262.net/fs/pages/2799>
IGNORING PDF https://resources.finalsite.net/images/v1673803989/usd262net/bvokssior5jikny5ggwk/PTOMeetingMinutes2120docx.pdf
2023-11-30 15:00:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/about/report-bullying-safety-concerns> from <GET https://west.usd262.net/fs/pages/3560>
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/counseling> (referer: https://west.usd262.net/about)
2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.ksde.org': <GET http://www.ksde.org/Default.aspx?tabid=149>
2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.homeworkkansas.org': <GET http://www.homeworkkansas.org/>
IGNORING PDF https://resources.finalsite.net/images/v1673803943/usd262net/okuntylyovx2hn260gmt/PTOMeetingMinutes1919docx.pdf
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/nurses-page> (referer: https://west.usd262.net/about)
2023-11-30 15:00:41 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.kidshealth.org': <GET http://www.kidshealth.org/parent/firstaid_safe/>
IGNORING PDF https://resources.finalsite.net/images/v1673803972/usd262net/s8sipel9qrbd1kwqrklg/FebPTOMeetingMinutes1120docx.pdf
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/document-library> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/images/v1673803909/usd262net/zcygtqo4nk94alxapei2/PTOMeetingMinutes1719.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803957/usd262net/kpvyrmpdxbbwic1o9mkw/1-21-20PTOMeetingMinutes21201docx.pdf
2023-11-30 15:00:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://west.usd262.net/about/administration> (referer: https://west.usd262.net/about)
IGNORING PDF https://resources.finalsite.net/images/v1673803928/usd262net/aprcr3g9v0x76agcz81m/PTOMeetingMinutes2219docx.pdf
2023-11-30 15:00:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://west.usd262.net/about/sraff-directory> from <GET https://west.usd262.net/staff-directory>
2023-11-30 15:00:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://west.usd262.net> from <GET https://west.usd262.net/fs/resource-manager/view/446cdd83-e743-495f-b0f1-91318deef052>
IGNORING PDF https://resources.finalsite.net/images/v1673803888/usd262net/sun8frlao9rk4gftotnp/PTOMeetingMinutes2719.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803867/usd262net/cev2livmjpacfgyq0qrc/4202021PTOMeetingminutes.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803137/usd262net/km3nodsbggl5taziszk3/MicrosoftWord-TotallyCoolElementarySchool_1.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803819/usd262net/k5xboy8whfnymanvvuyk/MeetingminutesFeb.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803121/usd262net/u5ctbelnubnhgz9gw6wa/WestElementaryCounselingBrochurefinal-2008_1.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673784917/usd262net/lzwphtnhcoqjp9thds6n/FactSheet-TitleI-ParentInvolvement.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803778/usd262net/rwb5tlbdaap8e1wjiizl/NovemberPTOMeetingMinutes.pdf
2023-11-30 15:00:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.usd262.net/enrollment/student-health-information> from <GET https://west.usd262.net/fs/pages/3541>
IGNORING PDF https://resources.finalsite.net/images/v1673784914/usd262net/dkc6smzfpcylihjyl0mx/ESIBoardPolicies-19.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803487/usd262net/eyojl1bd1qdj3lp8bjki/RICE-RestIceCompresionElevation_1.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673784913/usd262net/mlrg9xwsotm3a6ccmazy/ESI-DocumentsforWebsite-19.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803799/usd262net/rczvldr6kah713hisfwx/JanuaryPTOMeetingMinutes.pdf
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/report-bullying-safety-concerns> (referer: https://west.usd262.net/about/emergency-safety-interventions-bullying)
IGNORING PDF https://resources.finalsite.net/images/v1673784915/usd262net/u4efohzm82jnbzzsqxd3/FERPANotificationofRights.pdf
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.p3tips.com': <GET https://www.p3tips.com/tipform.aspx?ID=217>
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.crisistextline.org': <GET https://www.crisistextline.org/texting-in/>
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.kbi.ks.gov': <GET https://www.kbi.ks.gov/sar>
2023-11-30 15:00:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'usd262.onlinesafetyhub.io': <GET https://usd262.onlinesafetyhub.io/>
IGNORING PDF https://resources.finalsite.net/images/v1673803764/usd262net/yswpmxj1ivn5dr4onfue/OctoberPTOmeeting.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803749/usd262net/cfcylzqvhzvvsacorltx/SeptemberPTOmeetingnotes.pdf
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/about/news> (referer: https://west.usd262.net/)
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://west.usd262.net/fs/pages/3508> (referer: https://west.usd262.net/about/document-library)
2023-11-30 15:00:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/schools> (referer: https://west.usd262.net/)
IGNORING PDF https://resources.finalsite.net/images/v1673803706/usd262net/doln0ockhdm39lkfntxm/NovPTOmeetingminutes162021.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803735/usd262net/zltnhhnyt2jz1fi8k8gy/MarchPTOMeetingMinutes222022.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803693/usd262net/fel78cnko0opxf96lefx/OctPTOMeetingminutes192021.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803721/usd262net/idowphs1sgrl2xrnellg/JanPTOMeetingMinutes1820221.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803679/usd262net/e6jc2mep0odspayjzxmo/SeptPTOMeetingMinutes.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803649/usd262net/eultmjehz33n29yf5nqt/PTOMeetingMinutes2020221.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803664/usd262net/x6uh9b9s0lxpm3h8nmdx/AugustthPTOMinutes.pdf
IGNORING PDF https://resources.finalsite.net/images/v1673803634/usd262net/izumknwsghgbzuouu4ui/PTOMeetingMinutes2320221.pdf
2023-11-30 15:00:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.usd262.net/enrollment/student-health-information> (referer: https://west.usd262.net/about/nurses-page)
2023-11-30 15:00:44 [scrapy.core.engine] INFO: Closing spider (finished)
2023-11-30 15:00:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 38365,
 'downloader/request_count': 65,
 'downloader/request_method_count/GET': 65,
 'downloader/response_bytes': 248536,
 'downloader/response_count': 65,
 'downloader/response_status_count/200': 24,
 'downloader/response_status_count/301': 6,
 'downloader/response_status_count/302': 34,
 'downloader/response_status_count/404': 1,
 'dupefilter/filtered': 402,
 'elapsed_time_seconds': 8.931808,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 11, 30, 23, 0, 44, 907376),
 'httpcompression/response_bytes': 795436,
 'httpcompression/response_count': 25,
 'log_count/DEBUG': 69,
 'log_count/INFO': 10,
 'offsite/domains': 35,
 'offsite/filtered': 962,
 'request_depth_max': 3,
 'response_received_count': 25,
 'scheduler/dequeued': 65,
 'scheduler/dequeued/memory': 65,
 'scheduler/enqueued': 65,
 'scheduler/enqueued/memory': 65,
 'start_time': datetime.datetime(2023, 11, 30, 23, 0, 35, 975568)}

相关问题