python 如何使用BeautifulSoup提取网站中的所有URL

xfb7svmp  于 12个月前  发布在  Python
关注(0)|答案(4)|浏览(89)

我正在做一个项目,需要从一个网站提取所有链接,使用此代码,我会得到所有的链接从一个URL:

import requests
from bs4 import BeautifulSoup, SoupStrainer

source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
links = []

for link in soup.find_all('a'):
    links.append(str(link))

问题是,如果我想提取所有URL,我必须编写另一个for循环,然后再写一个....我想提取所有的网址是存在于这个网站,并在这个网站的子域.有没有什么方法可以做到这一点,而不写嵌套的?即使编写了嵌套的for,我也不知道应该使用多少个for来获取所有URL。

ruarlubt

ruarlubt1#

哇,它需要大约30分钟找到一个解决方案,我发现了一个简单而有效的方法来做到这一点,正如@α-α м яιcαη提到的,一些时间,如果你的网站链接到一个大的网站,如谷歌,等,它不会停止,直到你的内存得到充分的数据.所以有些步骤你应该考虑。
1.做一个while循环来搜索你的网站来提取所有的url
1.使用异常处理来防止崩溃
1.删除重复项并分隔url
1.设置一个url的数量限制,比如当找到1000个url时
1.停止while循环,以防止您的电脑的内存变满
这里有一个示例代码,它应该工作正常,我实际上测试了它,这对我来说很有趣:

import requests
from bs4 import BeautifulSoup
import re
import time

source_code = requests.get('https://stackoverflow.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []

def remove_duplicates(l): # remove duplicates and unURL string
    for item in l:
        match = re.search("(?P<url>https?://[^\s]+)", item)
        if match is not None:
            links.append((match.group("url")))

for link in soup.find_all('a', href=True):
    data.append(str(link.get('href')))
flag = True
remove_duplicates(data)
while flag:
    try:
        for link in links:
            for j in soup.find_all('a', href=True):
                temp = []
                source_code = requests.get(link)
                soup = BeautifulSoup(source_code.content, 'lxml')
                temp.append(str(j.get('href')))
                remove_duplicates(temp)

                if len(links) > 162: # set limitation to number of URLs
                    break
            if len(links) > 162:
                break
        if len(links) > 162:
            break
    except Exception as e:
        print(e)
        if len(links) > 162:
            break

for url in links:
print(url)

输出将是:

https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackexchange.com/sites
https://stackoverflow.blog
https://stackoverflow.com/legal/cookie-policy
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/legal/terms-of-service/public
https://stackoverflow.com/teams
https://stackoverflow.com/teams
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/questions/55884514/what-is-the-incentive-for-curl-to-release-the-library-for-free/55885729#55885729
https://insights.stackoverflow.com/
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/jobs
https://stackoverflow.com/jobs/directory/developer-jobs
https://stackoverflow.com/jobs/salary
https://www.stackoverflowbusiness.com
https://stackoverflow.com/teams
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/enterprise
https://stackoverflow.com/company/about
https://stackoverflow.com/company/about
https://stackoverflow.com/company/press
https://stackoverflow.com/company/work-here
https://stackoverflow.com/legal
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/company/contact
https://stackexchange.com
https://stackoverflow.com
https://serverfault.com
https://superuser.com
https://webapps.stackexchange.com
https://askubuntu.com
https://webmasters.stackexchange.com
https://gamedev.stackexchange.com
https://tex.stackexchange.com
https://softwareengineering.stackexchange.com
https://unix.stackexchange.com
https://apple.stackexchange.com
https://wordpress.stackexchange.com
https://gis.stackexchange.com
https://electronics.stackexchange.com
https://android.stackexchange.com
https://security.stackexchange.com
https://dba.stackexchange.com
https://drupal.stackexchange.com
https://sharepoint.stackexchange.com
https://ux.stackexchange.com
https://mathematica.stackexchange.com
https://salesforce.stackexchange.com
https://expressionengine.stackexchange.com
https://pt.stackoverflow.com
https://blender.stackexchange.com
https://networkengineering.stackexchange.com
https://crypto.stackexchange.com
https://codereview.stackexchange.com
https://magento.stackexchange.com
https://softwarerecs.stackexchange.com
https://dsp.stackexchange.com
https://emacs.stackexchange.com
https://raspberrypi.stackexchange.com
https://ru.stackoverflow.com
https://codegolf.stackexchange.com
https://es.stackoverflow.com
https://ethereum.stackexchange.com
https://datascience.stackexchange.com
https://arduino.stackexchange.com
https://bitcoin.stackexchange.com
https://sqa.stackexchange.com
https://sound.stackexchange.com
https://windowsphone.stackexchange.com
https://stackexchange.com/sites#technology
https://photo.stackexchange.com
https://scifi.stackexchange.com
https://graphicdesign.stackexchange.com
https://movies.stackexchange.com
https://music.stackexchange.com
https://worldbuilding.stackexchange.com
https://video.stackexchange.com
https://cooking.stackexchange.com
https://diy.stackexchange.com
https://money.stackexchange.com
https://academia.stackexchange.com
https://law.stackexchange.com
https://fitness.stackexchange.com
https://gardening.stackexchange.com
https://parenting.stackexchange.com
https://stackexchange.com/sites#lifearts
https://english.stackexchange.com
https://skeptics.stackexchange.com
https://judaism.stackexchange.com
https://travel.stackexchange.com
https://christianity.stackexchange.com
https://ell.stackexchange.com
https://japanese.stackexchange.com
https://chinese.stackexchange.com
https://french.stackexchange.com
https://german.stackexchange.com
https://hermeneutics.stackexchange.com
https://history.stackexchange.com
https://spanish.stackexchange.com
https://islam.stackexchange.com
https://rus.stackexchange.com
https://russian.stackexchange.com
https://gaming.stackexchange.com
https://bicycles.stackexchange.com
https://rpg.stackexchange.com
https://anime.stackexchange.com
https://puzzling.stackexchange.com
https://mechanics.stackexchange.com
https://boardgames.stackexchange.com
https://bricks.stackexchange.com
https://homebrew.stackexchange.com
https://martialarts.stackexchange.com
https://outdoors.stackexchange.com
https://poker.stackexchange.com
https://chess.stackexchange.com
https://sports.stackexchange.com
https://stackexchange.com/sites#culturerecreation
https://mathoverflow.net
https://math.stackexchange.com
https://stats.stackexchange.com
https://cstheory.stackexchange.com
https://physics.stackexchange.com
https://chemistry.stackexchange.com
https://biology.stackexchange.com
https://cs.stackexchange.com
https://philosophy.stackexchange.com
https://linguistics.stackexchange.com
https://psychology.stackexchange.com
https://scicomp.stackexchange.com
https://stackexchange.com/sites#science
https://meta.stackexchange.com
https://stackapps.com
https://api.stackexchange.com
https://data.stackexchange.com
https://stackoverflow.blog?blb=1
https://www.facebook.com/officialstackoverflow/
https://twitter.com/stackoverflow
https://linkedin.com/company/stack-overflow
https://creativecommons.org/licenses/by-sa/4.0/
https://stackoverflow.blog/2009/06/25/attribution-required/
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising

Process finished with exit code 0

我设置的限制为162,你可以增加它,因为许多你想和你公羊允许。

vsmadaxz

vsmadaxz2#

这个怎么样?

import re,requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
source_code = requests.get('https://stackoverflow.com/')
doc = SimplifiedDoc(source_code.content.decode('utf-8')) # incoming HTML string
lst = doc.listA(url='https://stackoverflow.com/') # get all links
for a in lst:
  if(a['url'].find('stackoverflow.com')>0): #sub domains
    print (a['url'])

你也可以使用这个爬行框架,它可以帮助你做很多事情

from simplified_scrapy.spider import Spider, SimplifiedDoc
class DemoSpider(Spider):
  name = 'demo-spider'
  start_urls = ['http://www.example.com/']
  allowed_domains = ['example.com/']
  def extract(self, url, html, models, modelNames):
    doc = SimplifiedDoc(html)
    lstA = doc.listA(url=url["url"])
    return [{"Urls": lstA, "Data": None}]

from simplified_scrapy.simplified_main import SimplifiedMain
SimplifiedMain.startThread(DemoSpider())
0tdrvxhp

0tdrvxhp3#

我为我的任务实现了这个,它能够从所有页面和页面中提取所有URL。这是一个递归函数。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://stackoverflow.com"
df = pd.DataFrame()
links = []
def extract_links(url):
    print("source url",url)
    global links
    source_url = requests.get(url)
    soup = BeautifulSoup(source_url.content,"html.parser")
    for link in soup.find_all('a',href=True):
        try:
            if len(links) >=100:
                return
            if link.get('href').startswith("https://") and link.get("href") not in links:
                links.append(link.get('href'))
                extract_links(link.get('href'))

        except Exception as e:
            print("Unhandled exception",e)

extract_links(url)
df = pd.DataFrame({"links":links})
df.to_csv("links.csv")

让我解释一下它是如何工作的:
1.它将进入第一页:stackoverflow.com和链接将附加在链接列表中。
1.主页面中的任何链接都将作为URL参数传递给函数。这样,它将获取所有页面中的所有URL/链接,并将其添加到链接列表中。这有助于避免任何重复的链接。
1.如果没有可获取的链接,它将返回到主页并获取第二个链接,并将递归地在链接内的所有页面中移动。
1.最后,当没有更多的链接要解析时,所有的链接将被存储在一个 Dataframe 中。

u2nhd7ah

u2nhd7ah4#

好吧,实际上你所要求的是可能的,但这意味着一个无限循环,它会一直运行,直到你的内存BoOoOoOm
无论如何,这个想法应该像下面这样。

  • 您将使用for item in soup.findAll('a')then item.get('href')
  • 添加到set以摆脱重复的url,并与if条件is not None一起使用以摆脱None对象。
  • 然后不断地循环,直到你的set变成0类似于len(urls)

相关问题