“scraping yelp for reviews” 翻译为中文是 “抓取 Yelp 的评论”。

huangapple go评论98阅读模式
英文:

scraping yelp for reviews

问题

author = r.get('username') or 'Anonymous';
英文:

I wrote this program that pulls from a yelp page reviews, specifically the username, date, text and score, however the username doesn't appear in my json file, could it be because the username is set as a link? does anyone know how to solve this problem?

import requests
from bs4 import BeautifulSoup
import json

url = "https://www.yelp.ie/biz/diwali-indian-restaurant-dublin"
review_url = "https://www.yelp.ie/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={start}"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
biz_id = soup.select_one('[name=yelp-biz-id]')['content']

reviews = []

for start in range(0, 20, 10):  # <-- increase this range for loading more reviews
    r_url = review_url.format(biz_id=biz_id, start=start)
    data = requests.get(r_url, headers=headers).json()
    for r in data['reviews']:
        review_text = BeautifulSoup(r['comment']['text'], 'html.parser').text
        author = r.get('username') or 'Anonymous'
        date = r['localizedDate']
        score = r['rating']
        reviews.append({"author": author, "date": date, "review_text": review_text, "score": score})

with open('reviews.json', 'w') as f:
    json.dump(reviews, f, indent=4)

答案1

得分: 1

以下是翻译好的部分:

如果您想要获取用户名,您可以在数据的 user 键内找到它:

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.ie/biz/diwali-indian-restaurant-dublin"
review_url = "https://www.yelp.ie/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={start}"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
biz_id = soup.select_one('[name=yelp-biz-id]')['content']

reviews = []

for start in range(0, 20, 10):  # <-- 增加此范围以加载更多评论
    r_url = review_url.format(biz_id=biz_id, start=start)
    data = requests.get(r_url, headers=headers).json()
    for r in data['reviews']:
        review_text = BeautifulSoup(r['comment']['text'], 'html.parser').text
        author = r['user']['altText']   # <-- 用户名在这里
        date = r['localizedDate']
        score = r['rating']
        reviews.append({"author": author, "date": date, "review_text": review_text, "score": score})

with open('reviews.json', 'w') as f:
    json.dump(reviews, f, indent=4)

打印:

...
    {
        "author": "Kalyan C.",
        "date": "17/12/2022",
        "review_text": "Made a reservation online (yesterday) and arrived right on time (5-10 early in fact for a 8:45 PM appointment) with a party of 5 and to the disappointment it was not honored. Neither the manager or the owner was apologetic and all they say is if you booked it online few hours ago then we don't honor. What logic is that. So stupid. But they offered us to wait another 45 for a table. 8:45 pm dinner was already late and was the only appointment available and now wait till 9:30 to be seated... come on. Atleast offer a voucher for future for their own stupid logic of not honoring the reservation, not that it helps but we will at least I can feel that they really messed up.",
        "score": 1
    },
    {
        "author": "Evelyn K.",
        "date": "18/2/2023",
        "review_text": "Great spot for Indian cuisine in Dublin. It was packed before 6 PM so be sure to grab a reservation on opentable. Got the tikka masala w/ rice and garlic naan, and the samosa chat. Reasonably priced and delicious food.",
        "score": 5
    },
    {
        "author": "Tara M.",
        "date": "13/3/2023",
        "review_text": "so good!! I've eaten some great Indian food and it definitely hit the spot for a quick casual Indian dinner on a rainy day. nothing fancy or life changing, but quality, good basic Indian food in a very touristy area. not cheap, but Indian options here are slim so worth it for not having to travel far.",
        "score": 4
    }
...
英文:

If you want the username you can find it inside the user key of the data:

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.ie/biz/diwali-indian-restaurant-dublin"
review_url = "https://www.yelp.ie/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={start}"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
biz_id = soup.select_one('[name=yelp-biz-id]')['content']

reviews = []

for start in range(0, 20, 10):  # <-- increase this range for loading more reviews
    r_url = review_url.format(biz_id=biz_id, start=start)
    data = requests.get(r_url, headers=headers).json()
    for r in data['reviews']:
        review_text = BeautifulSoup(r['comment']['text'], 'html.parser').text
        author = r['user']['altText']   # <-- the username is here
        date = r['localizedDate']
        score = r['rating']
        reviews.append({"author": author, "date": date, "review_text": review_text, "score": score})

with open('reviews.json', 'w') as f:
    json.dump(reviews, f, indent=4)

Prints:


...
    {
        "author": "Kalyan C.",
        "date": "17/12/2022",
        "review_text": "Made a reservation online (yesterday) and arrived right on time (5-10 early in fact for a 8:45 PM appointment) with a party of 5 and to the disappointment it was not honored. Neither the manager or the owner was apologetic and all they say is if you booked it online few hours ago then we don't honor. What logic is that. So stupid. But they offered us to wait another 45 for a table. 8:45 pm dinner was already late and was the only appointment available and now wait till 9:30 to be seated... come on. Atleast offer a voucher for future for their own stupid logic of not honoring the reservation, not that it helps but we will at least I can feel that they really messed up.",
        "score": 1
    },
    {
        "author": "Evelyn K.",
        "date": "18/2/2023",
        "review_text": "Great spot for Indian cuisine in Dublin. It was packed before 6 PM so be sure to grab a reservation on opentable. Got the tikka masala w/ rice and garlic naan, and the samosa chat. Reasonably priced and delicious food.",
        "score": 5
    },
    {
        "author": "Tara M.",
        "date": "13/3/2023",
        "review_text": "so good!! I've eaten some great Indian food and it definitely hit the spot for a quick casual Indian dinner on a rainy day. nothing fancy or life changing, but quality, good basic Indian food in a very touristy area. not cheap, but Indian options here are slim so worth it for not having to travel far.",
        "score": 4
    }
...

huangapple
  • 本文由 发表于 2023年5月13日 23:01:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76243390.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定