我是新的scrapy无法找到一个适当的解决方案,我试图得到一个完美的段落,但无法做到这一点,我得到了一个列表,其中包含一些空值,如“”我如何才能删除他们在scrapy使用itemloader?我已经尽了最大的努力
这是我代码
import scrapy
from scrapy.loader import ItemLoader
from ..items import RcgroupsItem
class RcgroupSpider(scrapy.Spider):
name = 'rcgroup'
allowed_domains = ['rcgroups.com']
start_urls = ['https://www.rcgroups.com/forums/showthread.php?2911378-DJI-Dashboard-Modding-tips-tricks-and-results-OFFICIAL-THREAD/page2']
def parse(self, response):
cards = response.xpath("//div[@id='posts']/div[@align='center']")
for card in cards:
loader = ItemLoader(item=RcgroupsItem(), selector=card)
loader.add_xpath('number', ".//div[@class='thead_postbit_right']//a//text()")
loader.add_xpath('date', (".//div[@class='thead_postbit_left']/span/text()[1]"))
loader.add_xpath('name', ".//div[@class='postbit-name']/a/text()")
loader.add_xpath('post', (".//div[@class='postbit-content']/text()"))
loader.add_xpath('reply', (".//div[@class='postbit-content']/div//text()"))
yield loader.load_item()
这里是我的item.py
import scrapy
from scrapy.loader.processors import TakeFirst, MapCompose, Join
from w3lib.html import remove_tags
def normalize_space(value):
lst= " ".join(value.split())
return lst
class RcgroupsItem(scrapy.Item):
number = scrapy.Field(
output_processor= TakeFirst()
)
date = scrapy.Field(
input_processor = MapCompose(normalize_space),
output_processor= TakeFirst()
)
name = scrapy.Field(
output_processor= TakeFirst()
)
post = scrapy.Field(
input_processor = MapCompose(normalize_space)
)
reply = scrapy.Field(
input_processor = MapCompose(normalize_space)
)
这里是setting.py
BOT_NAME = 'rcgroups'
SPIDER_MODULES = ['rcgroups.spiders']
NEWSPIDER_MODULE = 'rcgroups.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
FEED_EXPORT_ENCODING= 'utf-8'
FEEDS = {
'output': {
'format': 'csv',
}
}
我得到的post输出是
'post': ['',
'Quad808,',
'',
"I think Mad genuinely be pilots decide on the "
'wisdom of the CopterSafehe's on.",
'',
"He's a in all the DJI threads... expect him to be "
'one here also.',
'',
'P.S. Drop me a PM....'],
如何删除空值并将其转换为正确的字符串?
1条答案
按热度按时间41zrol4v1#
请尝试: