debugging 空文件输出

我是非常新的，需要你的帮助。我想写一个脚本，我可以概括做网页抓取。到目前为止，我有下面的代码，但它一直给我一个空白的输出文件。我希望能够很容易地修改此代码工作的所有网站，并最终能够使搜索字符串更复杂一点。现在，我有CNN作为一个一般的网页，和“麦卡锡”B/c我想现在肯定有他的文章在里面。你能帮我把这个弄好吗？

#Begin Code
import requests
from bs4 import BeautifulSoup
import docx

# Set the search parameters
search_term = 'mccarthy'  # Set the search term
start_date = '2023-01-04'  # Set the start date (format: YYYY-MM-DD)
end_date = '2023-01-05'  # Set the end date (format: YYYY-MM-DD)
website = 'https://www.cnn.com'  # Set the website to search
document = open('testfile.docx','w')  # Open the existing Word document

# Initialize the list of articles and the page number
articles = []
page_number = 1

# Set the base URL for the search API
search_url = f'{website}/search'

# Set the base URL for the article page
article_base_url = f'{website}/article/'

while articles or page_number == 1:
    # Send a request to the search API
    response = requests.get(search_url, params={'q': search_term, 'from': start_date, 'to': end_date,     'page': page_number})

    # Check if the response is in JSON format
    if response.headers['Content-Type'] == 'application/json':
        # Load the JSON data
        data = response.json()

        # Get the list of articles from the JSON data
        articles = data['articles']
    else:
        # Parse the HTML content of the search results page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all articles on the search results page
        articles = soup.find_all('article')

    # Loop through the articles
    for article in articles:
        # Find the link element
        link_element = article.find('a', class_='title')

        # Extract the link from the link element
        link = link_element['href']

        # Check if the link is a relative URL
        if link.startswith('/'):
            # If the link is relative, convert it to an absolute URL
            link = f'{website}{link}'

        # Add the link to the document
        document.add_paragraph(link)

    # Increment the page number
    page_number += 1

# Save the document
document.close()

我已经尝试了无数次迭代，但我已经删除了他们所有，所以不能真正张贴任何在这里。这一直给我一个空白的输出文件。

这不会解决主要问题，但有几件事要修复：
https://edition.cnn.com/search?q=&from=0&size=10&page=1&sort=newest&types=all&section=
查看CNN搜索页面的URL，我们可以看到from参数不是指日期，而是一个数字，也就是说，如果from=5，它将只显示第5篇文章。因此，您可以从请求参数中删除'from'和'to'。
文章=汤.find_all（'文章'）
这将返回一个空列表，因为HTML页面中没有标记。检查CNN HTML，我们看到您要查找的URL位于标记中，因此我将该行更改为soup.find_all('div', class_="card container__item container__item--type- __item __item--type- ")
document = open（'testfile.docx'，'w'）#打开现有的Word文档
您已经导入了docx模块，但没有使用它。Word文档（需要额外的数据进行格式化）应该像这样打开document = Document()。以下是docx文档供参考：https://python-docx.readthedocs.io/en/latest/
当文章或页码== 1时：
我觉得这条线没必要。
主要的问题似乎是这个页面需要运行Javascript来呈现内容。使用request.get（）本身不会做到这一点。你需要使用一个库，如Requests-HTML。我尝试这样做，但文章仍然不能呈现，所以我不确定。

debugging 空文件输出

1条答案

相关问题

热门标签

最新问答