scrapy 如何停止和退出exec()函数调用的异步脚本？

91zkwejq 于 2023-08-05 发布在其他

关注(0)|答案(1)|浏览(181)

我有一个文件，我们称之为bs4_scraper.py。给出上下文：

scrap函数只是一个异步函数，它向网站发出异步请求。
get_pids_from_file_generator只是一个函数，它读取.txt文件，将每行（pid）添加到Generator，并返回它。

async def bs4_scraper():
    limit = Semaphore(8)
    tasks = []
    pids = get_pids_from_file_generator()
    for pid in pids:
        task = create_task(scrap(pid, fake_header(), limit))
        tasks.append(task)
    result = await gather(*tasks)
    return result

if __name__ == "__main__":
    try:
        run(bs4_scraper())
    except Exception as e:
        logger.error(e)

字符串
当我使用python bs4_scraper.py在终端中运行这个函数时，函数运行并在所有请求都完成时优雅地退出。这一点没有问题（我想是的）。
现在我有了这个单独的文件，它是一个Scrapy管道，在抓取过程结束时运行：

class WritePidErrorsPipeline:
    def close_spider(self, spider):
        pid_errors_file = generate_pid_errors_file()
        pg = PostgresDB()
        non_inserted_ids = pg.select_non_inserted_ids(pid_errors_file)
        if non_inserted_ids:
            self.insertion_errors_file(non_inserted_ids)
            bs4_file = os.path.abspath("bs4/bs4_scraper.py")
            exec(open(bs4_file).read()) # THE PROBLEM IS RIGHT HERE
        else:
            logger.info("[SUCCESS]: There are no items missing")

    def insertion_errors_file(
        self,
        non_inserted_ids: List[Tuple[str]],
        output_file: str = "insertion_errors.log",
    ) -> str:
        with open(output_file, "w", encoding="utf-8") as f:
            for non_inserted_id in non_inserted_ids:
                f.write(f"{non_inserted_id[0]}\n")
        return output_file

型
线路exec(open(bs4_file).read())出现问题。文件被调用，函数正常运行，但当它完成时，它不会退出，并在最后一个成功的请求后继续运行。看起来像一个僵尸进程，我不知道为什么会发生这种情况。
如何改进此代码以按预期运行？
注：如有英文错误，请见谅

scrapy

来源：https://stackoverflow.com/questions/76662647/how-to-stop-and-exit-an-async-script-called-by-exec-function

1条答案

按热度按时间

hwamh0ep1#

你确定文件真的 * 运行 *，并挂起后，它完成？因为一个明显的问题是在你的文件末尾的守卫if __name__ == "__main__":：这是一个Code Pattern，旨在确保被保护的部分将 * 仅 * 在该文件（包含if __name__ == "__main__":行的文件）是Python调用的主文件时运行。
当运行scrappy，IIRC时，主文件是其他scrappy脚本，这些脚本反过来将导入包含Pipeline的文件：此时，变量__name__将不再包含__nain__-相反，它将等于文件名，而不是.py。如果你不提供一个自定义的全局目录作为第二个参数，outter __name__变量将简单地传播到exec主体-所以，只要看看你的代码，可以说的是bs4_scrapper函数永远不会被调用。
事实上，你截断了你的文件，扔掉了import语句，这使得很难给予你一个明确的答案--我想在管道文件（或脚本）中，你有类似from asyncio import run的东西。拜托-这些不是可选的东西-他们是必要的东西，一个审查你的代码，知道发生了什么事。
无论哪种方式，你有这样一个导入，或者代码在某些情况下无法工作，就像你所说的那样-所以，如果问题是我在这里不得不猜测的，你可以通过在exec语句中将__name__变量设置为__main__来修复它-但是然后我们转到另一边：为什么要采用这种执行方式？你正在运行一个Python程序，阅读一个Python文件，并发出一条语句从文本编译它，以便代码可以运行-当你可以导入文件并调用函数时。
所以，你可以通过让你的代码像程序一样运行来修复它，而不是强迫一个文件被读取为“文本”和exec-ed：

import sys
from pathlib import Path
import asyncio

class ...
    def close_spyder(...):
        ...
        if non_inserted_ids:
            self.insertion_errors_file(non_inserted_ids)
            bs4_dir = Path("bs4").absolute()
            if bs4_dir not in sys.path:
                 sys.path.insert(0, str(bs4_dir))
            import bs4_scraper 
            result = asyncio.run(bs4_scraper.bs4_scraper())

字符串

赞(0）回复(0）举报 2023-08-05

我来回答

scrapy 如何停止和退出exec()函数调用的异步脚本？

1条答案

相关问题

热门标签

最新问答