csv 使用用户ID检索的网页搜罗数据?

m0rkklqb  于 2023-07-31  发布在  其他
关注(0)|答案(1)|浏览(113)

我应该解析Yelp评论,这样我就可以使用Pandas分析数据。最终目标是能够检测它们是否是假的。我如何获取用户ID以在CSV文件中包含以下列:

  • 发布评论的人加入Yelp的日期
  • 他们已发布的评论数
  • 平均评分(Mean)
  • 以及他们在最初收集的评论发布后1天内发布的评论数量(例如,如果评论发布于1/4/23,那么我想知道他们在1/3/23 - 1/5/23之间发布了多少评论)

这是我开始的代码:

from bs4 import BeautifulSoup as bs
import pandas as pd
import requests

# Set the URL of the Yelp page for the restaurant
restaurant_url = 'https://www.yelp.com/biz/gelati-celesti-virginia-beach-2'

# Set the headers for the HTTP request
headers = {
    'host': 'www.yelp.com'
}

# Use BeautifulSoup to parse the restaurant page HTML
restaurant_page = bs(requests.get(restaurant_url, headers=headers).text, 'lxml')

# Extract the unique Yelp business ID for the restaurant
biz_id = restaurant_page.find('meta', {'name': 'yelp-biz-id'}).get('content')

# Extract the total number of reviews for the restaurant
review_count = int(restaurant_page.find('a', {'href': '#reviews'}).text.split(' ')[0])

# Create an empty list to store the review data
data = []

# Iterate through each review page (10 reviews per page)
for review_page in range(0, review_count, 10):
    # Construct the API URL to fetch reviews for the given page
    review_api_url = f'https://www.yelp.com/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={review_page}'

    # Send a request to the review API and get the JSON response
    review_data = requests.get(review_api_url, headers=headers).json()

    # Iterate through each review in the response
    for review in review_data['reviews']:
        # Extract the review text, rating, and date
        review_text = review['comment']['text']
        rating = review['rating']
        date = review['localizedDate']

        # Append the extracted data as a dictionary to the 'data' list
        data.append({
            'Review Text': review_text,
            'Rating': rating,
            'Date': date
        })

        # Print the last added review data
        print(data[-1])

# Create a DataFrame from the collected review data
df = pd.DataFrame(data)

# Save the DataFrame as a CSV file
df.to_csv('Yelp Review.csv', index=None)

字符串
这就是我所尝试的:

from bs4 import BeautifulSoup as bs
import pandas as pd
import requests

# Set the URL of the Yelp page for the restaurant
restaurant_url = 'https://www.yelp.com/biz/gelati-celesti-virginia-beach-2'

# Set the headers for the HTTP request
headers = {
    'host': 'www.yelp.com'
}

# Use BeautifulSoup to parse the restaurant page HTML
restaurant_page = bs(requests.get(restaurant_url, headers=headers).text, 'lxml')

# Extract the unique Yelp business ID for the restaurant
biz_id = restaurant_page.find('meta', {'name': 'yelp-biz-id'}).get('content')

# Extract the total number of reviews for the restaurant
review_count = int(restaurant_page.find('a', {'href': '#reviews'}).text.split(' ')[0])

# Create an empty list to store the review data
data = []

# Iterate through each review page (10 reviews per page)
for review_page in range(0, review_count, 10):
    # Construct the API URL to fetch reviews for the given page
    review_api_url = f'https://www.yelp.com/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={review_page}'

    # Send a request to the review API and get the JSON response
    review_data = requests.get(review_api_url, headers=headers).json()

    # Iterate through each review in the response
    for review in review_data['reviews']:
        # Extract the review text, rating, and date
        review_text = review['comment']['text']
        rating = review['rating']
        date = review['localizedDate']

        # Extract the reviewer's information
        reviewer = review['user']
        join_date = reviewer['joinDate']
        review_count = reviewer['reviewCount']
        average_rating = reviewer['averageRating']
        reviewer_id = reviewer['id']

        # Fetch additional information from the reviewer's user ID
        user_info_url = f'https://www.yelp.com/user_details?userid={reviewer_id}'
        user_info_page = bs(requests.get(user_info_url, headers=headers).text, 'lxml')
        
        # Extract additional information from the user profile if needed
        # For example, you can fetch user name, location, etc. using appropriate selectors
        
        # Append the extracted data as a dictionary to the 'data' list
        data.append({
            'Review Text': review_text,
            'Rating': rating,
            'Date': date,
            'Join Date': join_date,
            'Review Count': review_count,
            'Average Rating': average_rating,
            'Reviewer ID': reviewer_id
        })

        # Print the last added review data
        print(data[-1])

# Create a DataFrame from the collected review data
df = pd.DataFrame(data)

# Calculate the number of reviews within a 1-day span for each review
df['Reviews Within 1 Day'] = 0
df['Date'] = pd.to_datetime(df['Date'])

for i, row in df.iterrows():
    current_date = row['Date']
    one_day_before = current_date - pd.DateOffset(days=1)
    one_day_after = current_date + pd.DateOffset(days=1)
    
    reviews_within_1_day = df[(df['Date'] >= one_day_before) & (df['Date'] <= one_day_after)].shape[0]
    df.loc[i, 'Reviews Within 1 Day'] = reviews_within_1_day

# Save the DataFrame as a CSV file
df.to_csv('Yelp Review.csv', index=None)


我一直在犯各种错误。如果有人能帮助我,我会非常感激,此外,我如何让这对不推荐的审查部分的工作?(此https://www.yelp.com/not_recommended_reviews/gelati-celesti-virginia-beach-2而非此https://www.yelp.com/biz/gelati-celesti-virginia-beach-2

cidc1ykv

cidc1ykv1#

用户ID和用户ID URL存在于API请求的JSON响应中,而其他细节不存在于其中。

from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
from time import sleep

# Set the URL of the Yelp page for the restaurant
restaurant_url = 'https://www.yelp.com/biz/gelati-celesti-virginia-beach-2'

# Set the headers for the HTTP request
headers = {
    'host': 'www.yelp.com'
}

# Use BeautifulSoup to parse the restaurant page HTML
restaurant_page = bs(requests.get(restaurant_url, headers=headers).text, 'lxml')

# Extract the unique Yelp business ID for the restaurant
biz_id = restaurant_page.find('meta', {'name': 'yelp-biz-id'}).get('content')

# Extract the total number of reviews for the restaurant
review_count = int(restaurant_page.find('a', {'href': '#reviews'}).text.split(' ')[0])

# Create an empty list to store the review data
data = []

# Iterate through each review page (10 reviews per page)
for review_page in range(0, review_count, 10):
    # Construct the API URL to fetch reviews for the given page
    review_api_url = f'https://www.yelp.com/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={review_page}'

    # Send a request to the review API and get the JSON response
    review_data = requests.get(review_api_url, headers=headers).json()

    # Iterate through each review in the response
    for review in review_data['reviews']:
        # Append the extracted data as a dictionary to the 'data' list
        data.append({
            'Review Text': review['comment']['text'],
            'Rating': review['rating'],
            'Date': review['localizedDate'],
            'User ID': review['userId'],
            'User Name': review['user']['markupDisplayName'],
            'User Profile URL': 'https://www.yelp.com/' + review['user']['link'],
            'User Review Count': review['user']['reviewCount'],
            'Elite Year': review['user']['eliteYear'] if 'eliteYear' in review['user'] else '',
            'Feedback Useful': review['feedback']['counts']['useful'],
            'Feedback Funny': review['feedback']['counts']['funny'],
            'Feedback Cool': review['feedback']['counts']['cool'],
        })

        # Print the last added review data
        print(data[-1])
        sleep(6)

# Create a DataFrame from the collected review data
df = pd.DataFrame(data)

# Save the DataFrame as a CSV file
df.to_csv('Yelp Reviews.csv', index=None)

字符串
这是更新后的代码,如果需要,还添加了一些额外的字段。
下面是该代码的示例输出:

的数据

相关问题