我应该解析Yelp评论,这样我就可以使用Pandas分析数据。最终目标是能够检测它们是否是假的。我如何获取用户ID以在CSV文件中包含以下列:
- 发布评论的人加入Yelp的日期
- 他们已发布的评论数
- 平均评分(Mean)
- 以及他们在最初收集的评论发布后1天内发布的评论数量(例如,如果评论发布于1/4/23,那么我想知道他们在1/3/23 - 1/5/23之间发布了多少评论)
这是我开始的代码:
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
# Set the URL of the Yelp page for the restaurant
restaurant_url = 'https://www.yelp.com/biz/gelati-celesti-virginia-beach-2'
# Set the headers for the HTTP request
headers = {
'host': 'www.yelp.com'
}
# Use BeautifulSoup to parse the restaurant page HTML
restaurant_page = bs(requests.get(restaurant_url, headers=headers).text, 'lxml')
# Extract the unique Yelp business ID for the restaurant
biz_id = restaurant_page.find('meta', {'name': 'yelp-biz-id'}).get('content')
# Extract the total number of reviews for the restaurant
review_count = int(restaurant_page.find('a', {'href': '#reviews'}).text.split(' ')[0])
# Create an empty list to store the review data
data = []
# Iterate through each review page (10 reviews per page)
for review_page in range(0, review_count, 10):
# Construct the API URL to fetch reviews for the given page
review_api_url = f'https://www.yelp.com/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={review_page}'
# Send a request to the review API and get the JSON response
review_data = requests.get(review_api_url, headers=headers).json()
# Iterate through each review in the response
for review in review_data['reviews']:
# Extract the review text, rating, and date
review_text = review['comment']['text']
rating = review['rating']
date = review['localizedDate']
# Append the extracted data as a dictionary to the 'data' list
data.append({
'Review Text': review_text,
'Rating': rating,
'Date': date
})
# Print the last added review data
print(data[-1])
# Create a DataFrame from the collected review data
df = pd.DataFrame(data)
# Save the DataFrame as a CSV file
df.to_csv('Yelp Review.csv', index=None)
字符串
这就是我所尝试的:
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
# Set the URL of the Yelp page for the restaurant
restaurant_url = 'https://www.yelp.com/biz/gelati-celesti-virginia-beach-2'
# Set the headers for the HTTP request
headers = {
'host': 'www.yelp.com'
}
# Use BeautifulSoup to parse the restaurant page HTML
restaurant_page = bs(requests.get(restaurant_url, headers=headers).text, 'lxml')
# Extract the unique Yelp business ID for the restaurant
biz_id = restaurant_page.find('meta', {'name': 'yelp-biz-id'}).get('content')
# Extract the total number of reviews for the restaurant
review_count = int(restaurant_page.find('a', {'href': '#reviews'}).text.split(' ')[0])
# Create an empty list to store the review data
data = []
# Iterate through each review page (10 reviews per page)
for review_page in range(0, review_count, 10):
# Construct the API URL to fetch reviews for the given page
review_api_url = f'https://www.yelp.com/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={review_page}'
# Send a request to the review API and get the JSON response
review_data = requests.get(review_api_url, headers=headers).json()
# Iterate through each review in the response
for review in review_data['reviews']:
# Extract the review text, rating, and date
review_text = review['comment']['text']
rating = review['rating']
date = review['localizedDate']
# Extract the reviewer's information
reviewer = review['user']
join_date = reviewer['joinDate']
review_count = reviewer['reviewCount']
average_rating = reviewer['averageRating']
reviewer_id = reviewer['id']
# Fetch additional information from the reviewer's user ID
user_info_url = f'https://www.yelp.com/user_details?userid={reviewer_id}'
user_info_page = bs(requests.get(user_info_url, headers=headers).text, 'lxml')
# Extract additional information from the user profile if needed
# For example, you can fetch user name, location, etc. using appropriate selectors
# Append the extracted data as a dictionary to the 'data' list
data.append({
'Review Text': review_text,
'Rating': rating,
'Date': date,
'Join Date': join_date,
'Review Count': review_count,
'Average Rating': average_rating,
'Reviewer ID': reviewer_id
})
# Print the last added review data
print(data[-1])
# Create a DataFrame from the collected review data
df = pd.DataFrame(data)
# Calculate the number of reviews within a 1-day span for each review
df['Reviews Within 1 Day'] = 0
df['Date'] = pd.to_datetime(df['Date'])
for i, row in df.iterrows():
current_date = row['Date']
one_day_before = current_date - pd.DateOffset(days=1)
one_day_after = current_date + pd.DateOffset(days=1)
reviews_within_1_day = df[(df['Date'] >= one_day_before) & (df['Date'] <= one_day_after)].shape[0]
df.loc[i, 'Reviews Within 1 Day'] = reviews_within_1_day
# Save the DataFrame as a CSV file
df.to_csv('Yelp Review.csv', index=None)
型
我一直在犯各种错误。如果有人能帮助我,我会非常感激,此外,我如何让这对不推荐的审查部分的工作?(此https://www.yelp.com/not_recommended_reviews/gelati-celesti-virginia-beach-2而非此https://www.yelp.com/biz/gelati-celesti-virginia-beach-2)
1条答案
按热度按时间cidc1ykv1#
用户ID和用户ID URL存在于API请求的JSON响应中,而其他细节不存在于其中。
字符串
这是更新后的代码,如果需要,还添加了一些额外的字段。
下面是该代码的示例输出:
的数据