如何将python scrapy crawlspider的输出保存到单独的txt文件中?

bf1o4zei  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(162)

我有一个csv文件中的网站列表如下,

id, url
100, example1.com
200, example2.com
300, example3.com
...

我正在特灵写一个python scrapy crawlspider来下载网站中的所有文本。我需要将每个网站的文本保存为一个单独的txt文件,文件名为id,如100.txt和200.txt,以便进一步的文本分析。下面是我的scrapy代码,

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import pandas as pd

df = pd.read_csv('Test2.csv')
df['id'] = df['id'].apply(str)

class Hosp2Spider(CrawlSpider):
    name = 'hosp2'

    def __init__(self, url=None, *args,**kwargs):
        for index, row in df.iterrows():
            url = row['url']
            super(Hosp2Spider, self).__init__(*args,**kwargs)
            self.allowed_domains = [url]
            self.start_urls = ["http://" + url]

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        raw = response.xpath('//body//text()').extract()
        out = ','.join(raw)

        for index, row in df.iterrows():
            with open(row['id'] + '.txt', 'a+', encoding='utf-8') as f_out:
                out = f_out.write(out)

注意:我使用“follow=False”是为了测试目的。调试之后,我会将它更新为整个站点的follow=True。
我收到错误消息:“out = f_out.写入(输出)类型错误:write()参数必须是str,而不是int”。只生成了前两个txt文件,但第一个文件(100.txt)包含第二个文件(example2.com)的文本。第二个txt文件(200.txt)为空。如何修复此问题?非常感谢您的建议。谢谢。
2022年4月1日更新##########################
我更新了代码并解决了TypeError的问题:write()参数必须是str,而不是int”,方法是修改为以下代码。

for index, row in df.iterrows():
         if isinstance(out, str):
             with open(row['id'] + '.txt', 'a+', encoding='utf-8') as f_out:
             out = f_out.write(out)
         else:
             pass

现在,100.txt、200.txt和300.txt都显示出来了。但是,所有有意义的out都在100.txt中。
200. txt和300.txt中只包含一些数字,如“0161480161480161480161480161480161480189162891628916289162891628916289162891628916289 ......",如何将提取的文本保存到相应的.txt文件中?谢谢。

ctzwtxfj

ctzwtxfj1#

我用它来创建用户名文件。虽然很简单,但是很管用


# get the file names

infileName = input("What file are the names in? ")
outfileName = input("What file should the usernames go in? ")

# open the files

infile = open(infileName, "r")
outfile = open(outfileName, "w")

# process each line of the input file

for line in infile:
    # get the first and last names from line
    first, last = line.split()
    # create the username
    uname = (first[0]+last[:7]).lower()
    # write it to the output file
    print(uname, file=outfile)

# close both files

infile.close()
outfile.close()

print("Usernames have been written to", outfileName)

主服务器()

相关问题