抓取网址以获取评论

huangapple go评论132阅读模式
英文:

scraping a URL address for reviews

问题

我需要从此网站上的产品URL中提取评论,具体包括用户名、日期、文本和评分。但是,我遇到了一些问题,因为一直出现错误消息:“无法检索第1页的评论。错误:"连接中断:InvalidChunkLength(长度为b'',读取0字节)"; "InvalidChunkLength(长度为b'',读取0字节)"; 我尝试添加时间延迟,但仍然无法解决。我该如何修改这个问题?

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.emag.ro/covor-antiderapant-negru-poliester-80-x-300-cm-c027-80x300/pd/DBY5YJMBM/?ref=sponsored_products_fill_a_b_5_3&provider=rec&recid=rec_73_c449bb3e50b63cc8f6da4a42a31af359f6cbfb3c547bc5748cb6d45501a29685_1684315709&scenario_ID=73&aid=034a897a-956c-11ed-9004-0ab644dfda7c&oid=89847310"

review_url = "https://www.emag.ro/review/get-review-listing-page?id={product_id}&page={page}"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}

product_id = url.split("/pd/")[1].split("/")[0]

reviews = []

page = 1
while True:
    r_url = review_url.format(product_id=product_id, page=page)
    try:
        response = requests.get(r_url, headers=headers)
        response.raise_for_status()  
        data = response.json()
    except (requests.RequestException, json.JSONDecodeError) as e:
        print(f"Failed to retrieve reviews for page {page}. Error: {str(e)}")
        break

    if not data['reviews']:
        break

    for r in data['reviews']:
        review_text = r['content']
        author = r['author']['name']
        date = r['date']
        score = r['rating']
        reviews.append({"author": author, "date": date, "review_text": review_text, "score": score})

    page += 1

with open('reviews.json', 'w') as f:
    json.dump(reviews, f, indent=4)
英文:

So I need to extract the reviews from the URL of a product on this site, more specifically the username, date, text, and score. However, I have some issues with it because I keep getting an error: failed to retrieve reviews for page 1. Error: "Connection broken: InvalidChunkLength(got length b'', 0 bytes read)"; "InvalidChunkLength(got length b'', 0 bytes read)"; I tried adding a time delay but it still doesn't work. How can I modify this?

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.emag.ro/covor-antiderapant-negru-poliester-80-x-300-cm-c027-80x300/pd/DBY5YJMBM/?ref=sponsored_products_fill_a_b_5_3&provider=rec&recid=rec_73_c449bb3e50b63cc8f6da4a42a31af359f6cbfb3c547bc5748cb6d45501a29685_1684315709&scenario_ID=73&aid=034a897a-956c-11ed-9004-0ab644dfda7c&oid=89847310"

review_url = "https://www.emag.ro/review/get-review-listing-page?id={product_id}&page={page}"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/112.0'}


product_id = url.split("/pd/")[1].split("/")[0]

reviews = []

page = 1
while True:
    r_url = review_url.format(product_id=product_id, page=page)
    try:
        response = requests.get(r_url, headers=headers)
        response.raise_for_status()  
        data = response.json()
    except (requests.RequestException, json.JSONDecodeError) as e:
        print(f"Failed to retrieve reviews for page {page}. Error: {str(e)}")
        break

    if not data['reviews']:
        break

    for r in data['reviews']:
        review_text = r['content']
        author = r['author']['name']
        date = r['date']
        score = r['rating']
        reviews.append({"author": author, "date": date, "review_text": review_text, "score": score})

    page += 1

with open('reviews.json', 'w') as f:
    json.dump(reviews, f, indent=4)

答案1

得分: 1

以下是翻译好的内容:

  1. 您完全弄错了审查URL。
  2. 要获取评论,您需要以下部分,例如:
  3. 这会为您提供一个包含您需要的所有内容的JSON。自己检查一下:

这里是一个完整的工作示例:

import json
import re

import requests

SAMPLE_URLS = [
    "https://www.emag.ro/covor-pufos-moale-compatibil-multiple-spatii-si-stiluri-grosime-4cm-120cm-x-160cm-gri-ronyes18/pd/DBSKZPMBM/?ref=profiled_categories_home_1_3&provider=rec&recid=rec_50_c4fe1107f88ac6bacec1c30b98de4480bfbe39d2b440960a4182a588f76c40f9_1684357366&scenario_ID=50",
    "https://www.emag.ro/covor-kring-meknes-1200-gsm-100-poliester-160x230-cm-maro-e2020-8b/pd/D605NYMBM/?ref=profiled_categories_home_1_2&provider=rec&recid=rec_50_c4fe1107f88ac6bacec1c30b98de4480bfbe39d2b440960a4182a588f76c40f9_1684357366&scenario_ID=50",
    "https://www.emag.ro/antiderapant-pentru-covor-tip-plasa-cali-poliester-80-x-180-cm-j119c9-99st4951/pd/DZDZXTMBM/?ref=profiled_categories_home_1_1&provider=rec&recid=rec_50_c4fe1107f88ac6bacec1c30b98de4480bfbe39d2b440960a4182a588f76c40f9_1684357366&scenario_ID=50"
]

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42",
}

PRODUCT_ID = re.compile(r"pd/(.*?)/reviews")


def splitter(url: str) -> str:
    return url.rsplit('?')[0].split('ro/')[-1]


def save_json(data: dict, filename: str) -> None:
    with open(filename, 'w') as file:
        json.dump(data, file, indent=4)


review_urls = [
    f"https://www.emag.ro/product-feedback/{splitter(url)}reviews/list"
    for url in SAMPLE_URLS
]

with requests.Session() as s:
    s.headers.update(HEADERS)
    for url in review_urls:
        review_data = s.get(url).json()
        save_json(review_data, f"{re.search(PRODUCT_ID, url).group(1)}.json")
        print(url)
        for review in review_data["reviews"]["items"]:
            print(f"{review['rating']}: {review['title']}")
            print(f"{review['content']}")

这应该打印(为了简洁起见已缩短):

这也会在迭代URLs时将JSON响应转储到文件中。

例如,运行此命令:

$ jq -r '.reviews.items[0].product.image' < D605NYMBM.json

应该输出:

{
  "original": "https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg",
  "resized_images": [
    {
      "size": "80x80",
      "url": "https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg?width=80&height=80&hash=F675BCF7E208EEA9E59AAC0EFB10F96D"
    },
    {
      "size": "130x130",
      "url": "https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg?width=100&height=100&hash=2EED3E66E2550EE01608A7A622C72BBD"
    },
    {
      "size": "300x300",
      "url": "https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg?width=300&height=300&hash=AAD99FCED7B7A35CC54810994A4EA9B8"
    },
    {
      "size": "450x450",
      "url": "https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg?width=450&height=450&hash=21D2141F4007A320F28E893E5E299114"
    }
  ]
}
英文:

You got the review URL all wrong.

To get the reviews, you need the following parts, for example:

  1. https://www.emag.ro/product-feedback/
  2. covor-pufos-moale-compatibil-multiple-spatii-si-stiluri-grosime-4cm-120cm-x-160cm-gri-ronyes18/pd/DBSKZPMBM
  3. /reviews/list

This gives you a JSON with everything you need. Check it out yourself:

https://www.emag.ro/product-feedback/covor-kring-meknes-1200-gsm-100-poliester-160x230-cm-maro-e2020-8b/pd/D605NYMBM/reviews/list

Here's a full working example:

import json
import re

import requests

SAMPLE_URLS = [
    &quot;https://www.emag.ro/covor-pufos-moale-compatibil-multiple-spatii-si-stiluri-grosime-4cm-120cm-x-160cm-gri-ronyes18/pd/DBSKZPMBM/?ref=profiled_categories_home_1_3&amp;provider=rec&amp;recid=rec_50_c4fe1107f88ac6bacec1c30b98de4480bfbe39d2b440960a4182a588f76c40f9_1684357366&amp;scenario_ID=50&quot;,
    &quot;https://www.emag.ro/covor-kring-meknes-1200-gsm-100-poliester-160x230-cm-maro-e2020-8b/pd/D605NYMBM/?ref=profiled_categories_home_1_2&amp;provider=rec&amp;recid=rec_50_c4fe1107f88ac6bacec1c30b98de4480bfbe39d2b440960a4182a588f76c40f9_1684357366&amp;scenario_ID=50&quot;,
    &quot;https://www.emag.ro/antiderapant-pentru-covor-tip-plasa-cali-poliester-80-x-180-cm-j119c9-99st4951/pd/DZDZXTMBM/?ref=profiled_categories_home_1_1&amp;provider=rec&amp;recid=rec_50_c4fe1107f88ac6bacec1c30b98de4480bfbe39d2b440960a4182a588f76c40f9_1684357366&amp;scenario_ID=50&quot;
]

HEADERS = {
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) &quot;
                  &quot;AppleWebKit/537.36 (KHTML, like Gecko) &quot;
                  &quot;Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42&quot;,
}

PRODUCT_ID = re.compile(r&quot;pd/(.*?)/reviews&quot;)


def splitter(url: str) -&gt; str:
    return url.rsplit(&#39;?&#39;)[0].split(&#39;ro/&#39;)[-1]


def save_json(data: dict, filename: str) -&gt; None:
    with open(filename, &#39;w&#39;) as file:
        json.dump(data, file, indent=4)


review_urls = [
    f&quot;https://www.emag.ro/product-feedback/{splitter(url)}reviews/list&quot;
    for url in SAMPLE_URLS
]

with requests.Session() as s:
    s.headers.update(HEADERS)
    for url in review_urls:
        review_data = s.get(url).json()
        save_json(review_data, f&quot;{re.search(PRODUCT_ID, url).group(1)}.json&quot;)
        print(url)
        for review in review_data[&quot;reviews&quot;][&quot;items&quot;]:
            print(f&quot;{review[&#39;rating&#39;]}: {review[&#39;title&#39;]}&quot;)
            print(f&quot;{review[&#39;content&#39;]}&quot;)

This should print (shortened for brevity):

https://www.emag.ro/product-feedback/covor-pufos-moale-compatibil-multiple-spatii-si-stiluri-grosime-4cm-120cm-x-160cm-gri-ronyes18/pd/DBSKZPMBM/reviews/list
2: Dezamagitor
A ajuns foarte repede, bine ambalat. Impaturit, f subtire, l-as numi paturica( voi anexa poze) se vede prin el asa este de subtire.
1: Nu recomand
C&#226;nd am deschis easy boxul, pentru că acolo a sosit, am rămas surprins sa vad cat e se mic pachetul. Că o pătură. Nici pe departe grosimea de 4 cm. Numai aspect de covor nu are. Nu recomand.
1: Nemultumit
Cel mai prost produs comandat vreodată, nu se poate numi covor pt ca nu este, este ca o pătura! Oribil! O mare reclamație din partea mea merita acest produs! Foarte nemulțumita! Un mare retur! Timp pierdut degeaba!
1: Nu recomand
Nu este deloc nici pe departe cu ce este &#238;n poza... Asta nu e covor ci pătură de picnic
1: Nu recomand
The worst material ever
1: Nu recomand
Nu-i ce am văzut &#238;n poză &#238;mi pare rău că l-am comandat
1: Nu recomand
Dezamăgitor, nu se poate numii covor ci mai degrabă o pătură, jalnic
1: Nu recomand
Nu se poate numi covor.
1: Nu recomand
Covorul este mai degrabă o paturica... Nu este deloc ca &#238;n poza și &#238;n descriere. Regret ca l-am comandat
1: Nu recomand
E efectiv o patura (pufoasa ce-i drept)

...

This also dumps the JSON response to a file as it iterates thru the URLs.

A sample JSON is too big to show it all, but running this, for example:

$ jq -r &#39;.reviews.items[0].product.image&#39; &lt; D605NYMBM.json

Should output:

{
  &quot;original&quot;: &quot;https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg&quot;,
  &quot;resized_images&quot;: [
    {
      &quot;size&quot;: &quot;80x80&quot;,
      &quot;url&quot;: &quot;https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg?width=80&amp;height=80&amp;hash=F675BCF7E208EEA9E59AAC0EFB10F96D&quot;
    },
    {
      &quot;size&quot;: &quot;130x130&quot;,
      &quot;url&quot;: &quot;https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg?width=100&amp;height=100&amp;hash=2EED3E66E2550EE01608A7A622C72BBD&quot;
    },
    {
      &quot;size&quot;: &quot;300x300&quot;,
      &quot;url&quot;: &quot;https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg?width=300&amp;height=300&amp;hash=AAD99FCED7B7A35CC54810994A4EA9B8&quot;
    },
    {
      &quot;size&quot;: &quot;450x450&quot;,
      &quot;url&quot;: &quot;https://s13emagst.akamaized.net/products/30770/30769634/images/res_835280d02e4d164887cce1917e98a10e.jpg?width=450&amp;height=450&amp;hash=21D2141F4007A320F28E893E5E299114&quot;
    }
  ]
}

答案2

得分: 0

I didn't know about this URL. Is there a similar URL to get alternative offers? For example, for product id DQQ60WBBM (https://www.emag.ro/-/pd/DQQ60WBBM), I want to get main offer data (seller, price) + 8 (at this time) alternative offers data (same).

英文:

@baduker I didn't know about this URL. Is there similar URL to get alternative offers? For example, for product id DQQ60WBBM (https://www.emag.ro/-/pd/DQQ60WBBM) I want to get main offer data (seller, price) + 8 (at this time) alternative offers data (same).

huangapple
  • 本文由 发表于 2023年5月18日 05:01:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276163.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定