如何将python scrapy crawlspider的输出保存到单独的txt文件中？

bf1o4zei 于 2022-11-09 发布在 Python

关注(0)|答案(1)|浏览(162)

我有一个csv文件中的网站列表如下，

id, url
100, example1.com
200, example2.com
300, example3.com
...

我正在特灵写一个python scrapy crawlspider来下载网站中的所有文本。我需要将每个网站的文本保存为一个单独的txt文件，文件名为id，如100.txt和200.txt，以便进一步的文本分析。下面是我的scrapy代码，

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import pandas as pd

df = pd.read_csv('Test2.csv')
df['id'] = df['id'].apply(str)

class Hosp2Spider(CrawlSpider):
    name = 'hosp2'

    def __init__(self, url=None, *args,**kwargs):
        for index, row in df.iterrows():
            url = row['url']
            super(Hosp2Spider, self).__init__(*args,**kwargs)
            self.allowed_domains = [url]
            self.start_urls = ["http://" + url]

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        raw = response.xpath('//body//text()').extract()
        out = ','.join(raw)

        for index, row in df.iterrows():
            with open(row['id'] + '.txt', 'a+', encoding='utf-8') as f_out:
                out = f_out.write(out)

注意：我使用“follow=False”是为了测试目的。调试之后，我会将它更新为整个站点的follow=True。
我收到错误消息：“out = f_out.写入（输出）类型错误：write（）参数必须是str，而不是int”。只生成了前两个txt文件，但第一个文件（100.txt）包含第二个文件（example2.com）的文本。第二个txt文件（200.txt）为空。如何修复此问题？非常感谢您的建议。谢谢。
2022年4月1日更新##########################
我更新了代码并解决了TypeError的问题：write（）参数必须是str，而不是int”，方法是修改为以下代码。

for index, row in df.iterrows():
         if isinstance(out, str):
             with open(row['id'] + '.txt', 'a+', encoding='utf-8') as f_out:
             out = f_out.write(out)
         else:
             pass

现在，100.txt、200.txt和300.txt都显示出来了。但是，所有有意义的out都在100.txt中。
200. txt和300.txt中只包含一些数字，如“0161480161480161480161480161480161480189162891628916289162891628916289162891628916289 ......"，如何将提取的文本保存到相应的.txt文件中？谢谢。

scrapy

来源：https://stackoverflow.com/questions/71544142/how-to-save-python-scrapy-crawlspider-outputs-to-separate-txt-files

1条答案

按热度按时间

ctzwtxfj1#

我用它来创建用户名文件。虽然很简单，但是很管用


# get the file names

infileName = input("What file are the names in? ")
outfileName = input("What file should the usernames go in? ")

# open the files

infile = open(infileName, "r")
outfile = open(outfileName, "w")

# process each line of the input file

for line in infile:
    # get the first and last names from line
    first, last = line.split()
    # create the username
    uname = (first[0]+last[:7]).lower()
    # write it to the output file
    print(uname, file=outfile)

# close both files

infile.close()
outfile.close()

print("Usernames have been written to", outfileName)

主服务器（）

赞(0）回复(0）举报 2022-11-09

我来回答

如何将python scrapy crawlspider的输出保存到单独的txt文件中？

1条答案

相关问题

热门标签

最新问答