英文:
scraping yelp for reviews
问题
author = r.get('username') or 'Anonymous';
英文:
I wrote this program that pulls from a yelp page reviews, specifically the username, date, text and score, however the username doesn't appear in my json file, could it be because the username is set as a link? does anyone know how to solve this problem?
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.yelp.ie/biz/diwali-indian-restaurant-dublin"
review_url = "https://www.yelp.ie/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={start}"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
biz_id = soup.select_one('[name=yelp-biz-id]')['content']
reviews = []
for start in range(0, 20, 10): # <-- increase this range for loading more reviews
r_url = review_url.format(biz_id=biz_id, start=start)
data = requests.get(r_url, headers=headers).json()
for r in data['reviews']:
review_text = BeautifulSoup(r['comment']['text'], 'html.parser').text
author = r.get('username') or 'Anonymous'
date = r['localizedDate']
score = r['rating']
reviews.append({"author": author, "date": date, "review_text": review_text, "score": score})
with open('reviews.json', 'w') as f:
json.dump(reviews, f, indent=4)
答案1
得分: 1
以下是翻译好的部分:
如果您想要获取用户名,您可以在数据的 user
键内找到它:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.ie/biz/diwali-indian-restaurant-dublin"
review_url = "https://www.yelp.ie/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={start}"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
biz_id = soup.select_one('[name=yelp-biz-id]')['content']
reviews = []
for start in range(0, 20, 10): # <-- 增加此范围以加载更多评论
r_url = review_url.format(biz_id=biz_id, start=start)
data = requests.get(r_url, headers=headers).json()
for r in data['reviews']:
review_text = BeautifulSoup(r['comment']['text'], 'html.parser').text
author = r['user']['altText'] # <-- 用户名在这里
date = r['localizedDate']
score = r['rating']
reviews.append({"author": author, "date": date, "review_text": review_text, "score": score})
with open('reviews.json', 'w') as f:
json.dump(reviews, f, indent=4)
打印:
...
{
"author": "Kalyan C.",
"date": "17/12/2022",
"review_text": "Made a reservation online (yesterday) and arrived right on time (5-10 early in fact for a 8:45 PM appointment) with a party of 5 and to the disappointment it was not honored. Neither the manager or the owner was apologetic and all they say is if you booked it online few hours ago then we don't honor. What logic is that. So stupid. But they offered us to wait another 45 for a table. 8:45 pm dinner was already late and was the only appointment available and now wait till 9:30 to be seated... come on. Atleast offer a voucher for future for their own stupid logic of not honoring the reservation, not that it helps but we will at least I can feel that they really messed up.",
"score": 1
},
{
"author": "Evelyn K.",
"date": "18/2/2023",
"review_text": "Great spot for Indian cuisine in Dublin. It was packed before 6 PM so be sure to grab a reservation on opentable. Got the tikka masala w/ rice and garlic naan, and the samosa chat. Reasonably priced and delicious food.",
"score": 5
},
{
"author": "Tara M.",
"date": "13/3/2023",
"review_text": "so good!! I've eaten some great Indian food and it definitely hit the spot for a quick casual Indian dinner on a rainy day. nothing fancy or life changing, but quality, good basic Indian food in a very touristy area. not cheap, but Indian options here are slim so worth it for not having to travel far.",
"score": 4
}
...
英文:
If you want the username you can find it inside the user
key of the data:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.ie/biz/diwali-indian-restaurant-dublin"
review_url = "https://www.yelp.ie/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={start}"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
biz_id = soup.select_one('[name=yelp-biz-id]')['content']
reviews = []
for start in range(0, 20, 10): # <-- increase this range for loading more reviews
r_url = review_url.format(biz_id=biz_id, start=start)
data = requests.get(r_url, headers=headers).json()
for r in data['reviews']:
review_text = BeautifulSoup(r['comment']['text'], 'html.parser').text
author = r['user']['altText'] # <-- the username is here
date = r['localizedDate']
score = r['rating']
reviews.append({"author": author, "date": date, "review_text": review_text, "score": score})
with open('reviews.json', 'w') as f:
json.dump(reviews, f, indent=4)
Prints:
...
{
"author": "Kalyan C.",
"date": "17/12/2022",
"review_text": "Made a reservation online (yesterday) and arrived right on time (5-10 early in fact for a 8:45 PM appointment) with a party of 5 and to the disappointment it was not honored. Neither the manager or the owner was apologetic and all they say is if you booked it online few hours ago then we don't honor. What logic is that. So stupid. But they offered us to wait another 45 for a table. 8:45 pm dinner was already late and was the only appointment available and now wait till 9:30 to be seated... come on. Atleast offer a voucher for future for their own stupid logic of not honoring the reservation, not that it helps but we will at least I can feel that they really messed up.",
"score": 1
},
{
"author": "Evelyn K.",
"date": "18/2/2023",
"review_text": "Great spot for Indian cuisine in Dublin. It was packed before 6 PM so be sure to grab a reservation on opentable. Got the tikka masala w/ rice and garlic naan, and the samosa chat. Reasonably priced and delicious food.",
"score": 5
},
{
"author": "Tara M.",
"date": "13/3/2023",
"review_text": "so good!! I've eaten some great Indian food and it definitely hit the spot for a quick casual Indian dinner on a rainy day. nothing fancy or life changing, but quality, good basic Indian food in a very touristy area. not cheap, but Indian options here are slim so worth it for not having to travel far.",
"score": 4
}
...
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论