如何使用Scrapy抓取表的所有数据

jmp7cifd  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(157)

我是Scrapy的新手。我刚刚上了一门课,写了代码,并在某种程度上理解了它。我面临的问题是只缓存第一个表的数据。
我试过了这是密码。

from ast import parse
from fileinput import filename
import scrapy

class PostsSpider(scrapy.Spider):
    name = "posts"

    start_urls= [
        'https://publicholidays.com.bd/2022-dates/'
    ]

    def parse(self, response):
        for post in response.css('table'):
            yield{
                'date' : post.css('td::text').getall()[0],
                'day' : post.css('td::text' ).getall()[1],
                'event' : post.css('tr td a::text').getall()[0]
            }

当我爬这个:
{"date": "21 Feb", "day": "Mon", "event": "Shaheed Day"}
如何获取表的所有数据?

nimxete2

nimxete21#

在css元素的选择上有一点小问题。现在它工作的很好。你可以直接运行代码。

from ast import parse
from fileinput import filename
import scrapy
from scrapy.crawler import CrawlerProcess

class PostsSpider(scrapy.Spider):
    name = "posts"

    start_urls= ['https://publicholidays.com.bd/2022-dates']

    def parse(self, response):
        for post in response.css('.publicholidays tbody tr'):
            yield{
                'date' : post.css('td:nth-child(1)::text').get(),
                'day' : post.css('td:nth-child(2)::text' ).get(),
                'event' : post.css('td:nth-child(3) a::text').get() or post.css('td:nth-child(3) span::text').get()
            }
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(PostsSpider)
    process.start()

输出:

{'date': '21 Feb', 'day': 'Mon', 'event': 'Shaheed Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '17 Mar', 'day': 'Thu', 'event': "Sheikh Mujibur Rahman's Birthday"}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '18 Mar', 'day': 'Fri', 'event': 'Shab e-Barat'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '26 Mar', 'day': 'Sat', 'event': 'Independence Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '14 Apr', 'day': 'Thu', 'event': 'Bengali New Year'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '28 Apr', 'day': 'Thu', 'event': 'Laylat al-Qadr'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '29 Apr', 'day': 'Fri', 'event': 'Jumatul Bidah'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '1 May', 'day': 'Sun', 'event': 'May Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '2 May', 'day': 'Mon', 'event': 'Eid ul-Fitr Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '3 May', 'day': 'Tue', 'event': 'Eid ul-Fitr'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '4 May', 'day': 'Wed', 'event': 'Eid ul-Fitr Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '16 May', 'day': 'Mon', 'event': 'Buddha Purnima'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Jul', 'day': 'Sat', 'event': 'Eid ul-Adha Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '\n', 'day': None, 'event': None}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '10 Jul', 'day': 'Sun', 'event': 'Eid ul-Adha'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '11 Jul', 'day': 'Mon', 'event': 'Eid ul-Adha Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Aug', 'day': 'Tue', 'event': 'Ashura'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '15 Aug', 'day': 'Mon', 'event': 'National Mourning Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '19 Aug', 'day': 'Fri', 'event': 'Shuba Janmashtami'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '5 Oct', 'day': 'Wed', 'event': 'Vijaya Dashami'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Oct', 'day': 'Sun', 'event': 'Eid-e-Milad un-Nabi'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '16 Dec', 'day': 'Fri', 'event': 'Victory Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
daupos2t

daupos2t2#

import scrapy

class QuestionSpider(scrapy.Spider):
    name = 'question'
    allowed_domains = ['publicholidays.com.bd']
    start_urls = ['https://publicholidays.com.bd/2022-dates/']

def parse(self, response):
    item = {}
    for a in response.xpath("//table//tr")[:-1]:
        if a.xpath("./td[1]/text()").get() != '\n':
            item["date"] = a.xpath("./td[1]/text()").get()
            item["day"] = a.xpath("./td[2]/text()").get()
            if a.xpath(".//a/text()").get() is not None:
                item["holiday"] = a.xpath(".//a/text()").get()
            else:
                item["holiday"] = a.xpath(".//span/text()").get()

            print(item)

相关问题