python 由于某种原因,write()不再将字符串写入.txt文件

vbkedwbf  于 2022-10-30  发布在  Python
关注(0)|答案(3)|浏览(149)

所以几天前我开始使用python,现在我尝试做一个函数,它可以提供网站的所有子页面。我知道这可能不是最优雅的函数,但我一直很自豪地看到它的工作。但由于一些我不知道的原因,我的功能不再起作用了。我可以发誓,自从上次起作用后,我就没有改变过那个功能。但是经过几个小时的调试,我开始慢慢怀疑自己了。你能看看为什么我的函数不再输出到.txt文件吗?我只是得到了一个空的文本文件。尽管如果我删除它,至少会创建一个新的(空的)文件。
我试着把保存字符串部分移出try块,但没有成功。我还试着用all_urls.flush()来保存所有内容。我重启了电脑,希望后台有东西访问了这个文件,使我无法在上面写东西。我还重命名了它应该保存的文件,以便生成真正新鲜的东西。还是同样的问题。我还控制了循环中的link以字符串的形式给出,所以这应该不是问题。我还尝试了:

print(link, file=all_urls, end='\n')

作为...的替代

all_urls.write(link)
all_urls.write('\n')

没有结果。
我的全部功能:

def get_subpages(url):
    # gets all subpage links from a website that start with the given url
    from urllib.request import urlopen, Request
    from bs4 import BeautifulSoup
    links = [url]
    tested_links = []
    to_test_links = links
    # open a .txt file to save results into
    all_urls = open('all_urls.txt', 'w')
    problematic_pages = open('problematic_pages.txt', 'w')
    while len(to_test_links)>0:
        for link in to_test_links:
            print('the link we are testing right now:', link)
            # add the current link to the tested list
            tested_links.append(link)
            try:
                print(type(link))
                all_urls.write(link)
                all_urls.write('\n')
                # Save it to the -txt file and make an abstract
                # get the link ready to be accessed
                req = Request(link)
                html_page = urlopen(req)
                soup = BeautifulSoup(html_page, features="html.parser")
                # empty previous temporary links
                templinks = []
                # move the links on the subpage link to templinks
                for sublink in soup.findAll('a'):
                    templinks.append(sublink.get('href'))
                # clean off accidental 'None' values
                templinks = list(filter(lambda item: item is not None, templinks))

                for templink in templinks:
                    # make sure we have still the correct website and don't accidentally crawl instagram etc.
                    # also avoid duplicates
                    if templink.find(url) == 0 and templink not in links:
                        links.append(templink)

                #and lastly refresh the to_test_links list with the newly found links before going back into the loop
                to_test_links = (list(set(links) ^ set(tested_links)))
            except:
                # Save it to the ERROR -txt file and make an abstract
                problematic_pages.write(link)
                problematic_pages.write('\n')
                print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
    all_urls.close()
    problematic_pages.close()
    return links
tpxzln5u

tpxzln5u1#

我无法重现这个问题,但是我曾经遇到过一些无法解释的(至少对我来说)文件处理错误,当我从with内部写入时,这些错误得到了解决。
[Just请确保首先删除当前代码中涉及allurl的行,以防万一-或者在检查它是否工作时使用不同的文件名尝试此操作]
由于您无论如何都要将所有的url追加到tested_links,因此您可以在while循环之后一次写入所有url

with open('all_urls.txt', 'w') as f:
    f.write('\n'.join('tested_links')+'\n')

或者,如果您必须逐个链接地编写,则可以通过使用mode='a'打开来追加:


# before the while, if you're not sure the file exists

# [and/or to clear previous data from file]

# with open('all_urls.txt', 'w') as f: f.write('')

                # and inside the try block: 
                with open('all_urls.txt', 'a') as f:                 
                    f.write(f'{link}\n')
ldioqlga

ldioqlga2#

不是一个直接的答案,但在我早期的日子里,这种情况发生在我身上。Python的requests模块发送带有标头指示Python的请求,这可以很快被网站检测到,你的IP可能会被阻止,你会得到不寻常的响应,这就是为什么你的工作功能现在不能工作。

解决方案:

使用自然请求头,请参见下面的代码

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
r = requests.get(URL, headers=headers)

使用代理的情况下,你被阻止在你的IP,这是强烈建议

0ve6wy6x

0ve6wy6x3#

以下是您的脚本经过细微更改,其中包含标记的(****************)更改:

def get_subpages(url):
    # gets all subpage links from a website that start with the given url
    from urllib.request import urlopen, Request
    from bs4 import BeautifulSoup
    links = [url]

    #*******************added sublinks_list variable ******************
    sublinks_list = []

    tested_links = []
    to_test_links = links
    # open a .txt file to save results into
    all_urls = open('all_urls.txt', 'w')
    problematic_pages = open('problematic_pages.txt', 'w')
    while len(to_test_links)>0:
        for link in to_test_links:
            print('the link we are testing right now:', link)
            # add the current link to the tested list
            tested_links.append(link)
            try:
                all_urls.write(link)
                all_urls.write('\n')
                # Save it to the -txt file and make an abstract
                # get the link ready to be accessed
                req = Request(link)
                html_page = urlopen(req)
                soup = BeautifulSoup(html_page, features="html.parser")
                # empty previous temporary links
                templinks = []
                # move the links on the subpage link to templinks
                sublinks = soup.findAll('a')

                for sublink in sublinks:

                    #templinks.append(sublink.get('href'))*****************changed the line with next row*****************
                    templinks.append(sublink['href'])

                # clean off accidental 'None' values
                templinks = list(filter(lambda item: item is not None, templinks))

                for templink in templinks:
                    # make sure we have still the correct website and don't accidentally crawl instagram etc.
                    # also avoid duplicates

                    #if templink.find(url) == 0 and templink not in links:*******************changed the line with next row*****************
                    if templink not in sublinks_list:

                        #links.append(templink)    *******************changed the line with next row*****************
                        sublinks_list.append(templink)

                        all_urls.write(templink + '\n')     #*******************added this line*****************

                #and lastly refresh the to_test_links list with the newly found links before going back into the loop
                to_test_links = (list(set(links) ^ set(tested_links)))

            except:
                # Save it to the ERROR -txt file and make an abstract
                problematic_pages.write(link)
                problematic_pages.write('\n')
                print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')

    all_urls.close()
    problematic_pages.close()
    return links

lnks = get_subpages('https://www.jhanley.com/blog/pyscript-creating-installable-offline-applications/')  #  #*******************url used for testing*****************

它工作正常,文件中有180多个链接。请自己测试。仍然有一些不合适和可疑的sintax,所以你应该再次彻底测试你的代码-但是将链接写入文件的部分工作正常。
此致。

相关问题