在Python中解析Robots.txt

jm2pwxwz 于 2023-09-29 发布在 Python

关注(0)|答案(3)|浏览(106)

我想在python中解析robots.txt文件。我已经探索了robotParser和robotExclusionParser，但没有什么真正满足我的标准。我想获取所有的diallowedUrls和allowedUrls在一个单一的镜头，而不是手动检查每个网址，如果它是允许的或不。有没有图书馆可以做到这一点？

python

来源：https://stackoverflow.com/questions/43085744/parsing-robots-txt-in-python

3条答案

按热度按时间

zujrkrfu1#

为什么你必须手动检查你的网址？您可以在Python 3中使用urllib.robotparser，并执行以下操作

import urllib.robotparser as urobot
import urllib.request
from bs4 import BeautifulSoup

url = "example.com"
rp = urobot.RobotFileParser()
rp.set_url(url + "/robots.txt")
rp.read()
if rp.can_fetch("*", url):
    site = urllib.request.urlopen(url)
    sauce = site.read()
    soup = BeautifulSoup(sauce, "html.parser")
    actual_url = site.geturl()[:site.geturl().rfind('/')]

    my_list = soup.find_all("a", href=True)
    for i in my_list:
        # rather than != "#" you can control your list before loop over it
        if i != "#":
            newurl = str(actual_url)+"/"+str(i)
            try:
                if rp.can_fetch("*", newurl):
                    site = urllib.request.urlopen(newurl)
                    # do what you want on each authorized webpage
            except:
                pass
else:
    print("cannot scrap")

赞(0）回复(0）举报 2023-09-29

6tdlim6h2#

您可以使用curl命令将robots.txt文件读取为单个字符串，并使用允许和不允许URL的新行检查将其拆分。

import os
result = os.popen("curl https://fortune.com/robots.txt").read()
result_data_set = {"Disallowed":[], "Allowed":[]}

for line in result.split("\n"):
    if line.startswith('Allow'):    # this is for allowed url
        result_data_set["Allowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info
    elif line.startswith('Disallow'):    # this is for disallowed url
        result_data_set["Disallowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info

print (result_data_set)

赞(0）回复(0）举报 2023-09-29

qyyhg6bp3#

实际上，RobotFileParser可以完成这项工作，请考虑以下代码

def iterate_rules(robots_content):
    rfp = RobotFileParser()
    rfp.parse(robots_content.splitlines())
    entries = [rfp.default_entry, *rfp.entries]\
              if rfp.default_entry else rfp.entries
    for entry in entries:
        for ruleline in entry.rulelines:
            yield (entry.useragents, ruleline.path, ruleline.allowance)

从my post on medium

赞(0）回复(0）举报 2023-09-29

我来回答

在Python中解析Robots.txt

3条答案

相关问题

热门标签

最新问答