高效地通过浏览多个不同的页面/类别来抓取网站数据。

huangapple go评论74阅读模式
英文:

Efficiently scrape website by going through multiple different pages/categories

问题

I am having difficulties advancing my current scraping project/idea. I am attempting to web scrape all products on an online shop by category. The website link is: https://eshop.nomin.mn/.
Currently, with the aid of the great developers on this forum, I have been able to scrape the Food/Grocery category successfully using the online shop's data API (My code is provided at the very bottom of my post). While I could duplicate this success for other categories by changing the data API URL, I believe it would be very insufficient and inefficient.
Ideally I want to scrape all categories of the website using one spider rather than making a spider for each category. I do not know how I should go around doing this as my previous projects the websites main page had all the products listed, whereas this does not. Furthermore, adding multiple Data API URLs does not seem to be working for me. Each category has a different URL and a different Data API, for example:
Electric products (https://eshop.nomin.mn/6011.html)
Food products (https://eshop.nomin.mn/n-foods.html)
Building material (https://eshop.nomin.mn/n-building-materials-tools.html)
Automobile products and parts (https://eshop.nomin.mn/n-autoparts-tools.html)
The image below shows how you can browse the website and the categories (translated to English).
Ideally my scrapped end product would be a long table such as this. I have included Original Price and Listed Price separately as some categories such as the electric products have two pricing HTML as shown below.


1
,
899
,
990

1
,
599
,
990

My current working code that successfully scrapes the food product category and retrieves 3000+ products name, description, and price. *Also I think since I will be scraping multiple pages/categories maybe having a rotating/random generated header/user-agent would be smart. What would be the best way to integrate this idea?

import scrapy
from scrapy import Request
from datetime import datetime

BASE_URL = "https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&operationName=category&variables="


dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' Nomin CPI Foods Data'

class NominCPIFoodsSpider(scrapy.Spider):
    name = 'nomin_cpi_foods'
    allowed_domains = ['https://eshop.nomin.mn/n-foods.html/']
    custom_settings = {
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }

    # function used for start url
    def start_requests(self):
        for i in range(50):
            url = BASE_URL + '{"currentPage":' + str(i) + ',"id":24175,"filters":{"category_id":{"in":"24175"}},"pageSize":50,"sort":{"position":"DESC"}}'
            yield Request(url, self.parse)

    # function to parse
    def parse(self, response, **kwargs):
        data = response.json()
        print(data.keys())
        for item in data['data']["products"]["items"]:
            yield {
                "name": item["name"],
                "price": item["price"]["regularPrice"]["amount"]["value"],
                "description": item["short_description"]["html"]
            }

        # handles pagination
        next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
        if next_url:
            yield scrapy.Request(next_url, self.parse)


if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(NominCPIFoodsSpider)
    process.start()

Sorry for the long post, and all and any help is greatly appreciated. Thank you very much. ^^


Note: This code appears to be in Python and is related to web scraping. If you have any specific questions or need further assistance with this code, please let me know.
<details>
<summary>英文:</summary>
I am having difficulties advancing my current scraping project/idea. I am attempting to web scrape all products on an online shop by category. The website link is: https://eshop.nomin.mn/.
Currently, with the aid of the great developers on this forum, I have been able to scrape the Food/Grocery category successfully using the online shop&#39;s data API (My code is provided at the very bottom of my post). While I could duplicate this success for other categories by changing the data API URL, I believe it would be very insufficient and inefficient.  
Ideally I want to scrape all categories of the website using one spider rather than making a spider for each category. I do not know how I should go around doing this as my previous projects the websites main page had all the products listed, whereas this does not. Furthermore, adding multiple Data API URLs does not seem to be working for me. Each category has a different URL and a different Data API, for example:
1. Electric products (https://eshop.nomin.mn/6011.html)
2. Food products (https://eshop.nomin.mn/n-foods.html)
3. Building material (https://eshop.nomin.mn/n-building-materials-tools.html)
4. Automobile products and parts (https://eshop.nomin.mn/n-autoparts-tools.html)
5. etc
The image below shows how you can browse the website and the categories (translated to English). 
[![enter image description here][1]][1]
Ideally my scrapped end product would be a long table such as this. I have included Original Price and Listed Price separately as some categories such as the electric products have two pricing HTML as shown below. 
&lt;div class=&quot;item-specialPricetag-1JM&quot;&gt;
&lt;span class=&quot;item-oldPrice-1sY&quot;&gt;
&lt;span&gt;1&lt;/span&gt;
&lt;span&gt;,&lt;/span&gt;
&lt;span&gt;899&lt;/span&gt;
&lt;span&gt;,&lt;/span&gt;
&lt;span&gt;990&lt;/span&gt;
&lt;span&gt;₮&lt;/span&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;div class=&quot;item-webSpecial-Z6W&quot;&gt;
&lt;span&gt;1&lt;/span&gt;
&lt;span&gt;,&lt;/span&gt;
&lt;span&gt;599&lt;/span&gt;
&lt;span&gt;,&lt;/span&gt;
&lt;span&gt;990&lt;/span&gt;
&lt;span&gt;₮&lt;/span&gt;
&lt;/div&gt;
[![enter image description here][2]][2]
My current working code that successfully scrapes the food product category and retrieves 3000+ products name, description, and price. *Also I think since I will be scraping multiple pages/categories maybe having a rotating/random generated header/user-agent would be smart. What would be the best way to integrate this idea?* 
import scrapy
from scrapy import Request
from datetime import datetime
BASE_URL = &quot;https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&amp;operationName=category&amp;variables=&quot;
dt_today = datetime.now().strftime(&#39;%Y%m%d&#39;)
filename = dt_today + &#39; Nomin CPI Foods Data&#39;
class NominCPIFoodsSpider(scrapy.Spider):
name = &#39;nomin_cpi_foods&#39;
allowed_domains = [&#39;https://eshop.nomin.mn/n-foods.html/&#39;]
custom_settings = {
&quot;FEEDS&quot;: {
f&#39;{filename}.csv&#39;: {
&#39;format&#39;: &#39;csv&#39;,
&#39;overwrite&#39;: True}}
}
# function used for start url
def start_requests(self):
for i in range(50):
url = BASE_URL + &#39;{&quot;currentPage&quot;:&#39; + str(i) + &#39;,&quot;id&quot;:24175,&quot;filters&quot;:{&quot;category_id&quot;:{&quot;in&quot;:&quot;24175&quot;}},&quot;pageSize&quot;:50,&quot;sort&quot;:{&quot;position&quot;:&quot;DESC&quot;}}&#39;
yield Request(url, self.parse)
# function to parse
def parse(self, response, **kwargs):
data = response.json()
print(data.keys())
for item in data[&#39;data&#39;][&quot;products&quot;][&quot;items&quot;]:
yield {
&quot;name&quot;: item[&quot;name&quot;],
&quot;price&quot;: item[&quot;price&quot;][&quot;regularPrice&quot;][&quot;amount&quot;][&quot;value&quot;],
&quot;description&quot;: item[&quot;short_description&quot;][&quot;html&quot;]
}
# handles pagination
next_url = response.css(&quot;nav.custom-pagination &gt; a.next::attr(href)&quot;).get()
if next_url:
yield scrapy.Request(next_url, self.parse)
if __name__ == &quot;__main__&quot;:
process = CrawlerProcess()
process.crawl(NominCPIFoodsSpider)
process.start()
Sorry for the long post, and all and any help is greatly appreciated. Thank you very much. ^^
[1]: https://i.stack.imgur.com/8jrhU.png
[2]: https://i.stack.imgur.com/rPWy2.png
</details>
# 答案1
**得分**: 2
以下是已翻译的代码部分:
```python
你可以做的是访问网站并浏览每个类别,获取该类别的 API URL,查看特定类别有多少页的信息,然后从 URL 中提取类别 ID 并在代码中创建一个将类别 ID 作为键和页数作为值的字典引用。
然后在你的 `start_requests` 方法中,不仅需要用一个变量替换当前页,还需要对类别做同样的操作。然后你几乎可以不改变其余部分。
有一件不必要的事情是继续解析实际的网页本身。你所需的所有信息都可以从 API 中获取,所以为不同页面生成请求并没有什么作用。
这是一个示例,使用网站上提供的一些类别。
...
要将类别与请求一起发送,你只需将类别名称存储在字典中,与 ID 和页数一起,并将其设置为每个 `start_url` 请求的 `cb_kwargs` 参数。
例如:
...
然后在你的 `start_requests` 方法中:
...
然后在你的 `parse` 方法中:
...

请注意,以上是已翻译的代码和说明部分。如果有其他问题或需要更多翻译,请告诉我。

英文:

What you can do is go to the website and visit each of the categories, grab the api url for that category, look to see how many pages of information that specific category has, and then extract the category ID out the URL and create a dictionary reference in your code that keeps the category ID as keys and the page number as values.

Then in your start_requests method, instead of only having to substitute the current page with a variable you can do the same for the category. Then you can pretty much leave the rest unchanged.

One thing that is unnecessary is to continue to parse the actual web pages themselves. ALl of the information you need is available from the API, so yielding requests for the different pages isn't really doing you any good.

Here is an example using a handful of the categories available on the site.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from datetime import datetime
categories = {
&quot;19653&quot;: 4,
&quot;24175&quot;: 67,
&quot;21297&quot;: 48,
&quot;19518&quot;: 16,
&quot;19487&quot;: 40,
&quot;26011&quot;: 46,
&quot;19767&quot;: 3,
&quot;19469&quot;: 5,
&quot;19451&quot;: 4
}
dt_today = datetime.now().strftime(&#39;%Y%m%d&#39;)
filename = dt_today + &#39; Nomin&#39;
class Nomin(scrapy.Spider):
name = &#39;nomin&#39;
custom_settings = {
&quot;FEEDS&quot;: {
f&#39;{filename}.csv&#39;: {
&#39;format&#39;: &#39;csv&#39;,
&#39;overwrite&#39;: True}}
}
def start_requests(self):
for cat, pages in categories.items():
for i in range(1, pages):
url = f&#39;https://eshop.nomin.mn/graphql?query=query+category%28%24pageSize%3AInt%21%24currentPage%3AInt%21%24filters%3AProductAttributeFilterInput%21%24sort%3AProductAttributeSortInput%29%7Bproducts%28pageSize%3A%24pageSize+currentPage%3A%24currentPage+filter%3A%24filters+sort%3A%24sort%29%7Bitems%7Bid+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal%7Bcreated_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename%7Dnew_to_date+short_description%7Bhtml+__typename%7DproductAttributes%7Bname+value+__typename%7Dprice%7BregularPrice%7Bamount%7Bcurrency+value+__typename%7D__typename%7D__typename%7Dspecial_price+special_to_date+thumbnail%7Bfile_small+url+__typename%7Durl_key+url_suffix+mp_label_data%7Benabled+name+priority+label_template+label_image+to_date+__typename%7D...on+ConfigurableProduct%7Bvariants%7Bproduct%7Bsku+special_price+price%7BregularPrice%7Bamount%7Bcurrency+value+__typename%7D__typename%7D__typename%7D__typename%7D__typename%7D__typename%7D__typename%7Dpage_info%7Btotal_pages+__typename%7Dtotal_count+__typename%7D%7D&amp;operationName=category&amp;variables=%7B%22currentPage%22%3A{i}%2C%22id%22%3A{cat}%2C%22filters%22%3A%7B%22category_id%22%3A%7B%22in%22%3A%22{cat}%22%7D%7D%2C%22pageSize%22%3A50%2C%22sort%22%3A%7B%22news_from_date%22%3A%22ASC%22%7D%7D&#39;
yield Request(url, self.parse)
def parse(self, response, **kwargs):
data = response.json()
if data and data[&#39;data&#39;] and data[&#39;data&#39;][&#39;products&#39;] and data[&#39;data&#39;][&#39;products&#39;][&#39;items&#39;]:
for item in data[&#39;data&#39;][&quot;products&quot;][&quot;items&quot;]:
yield {
&quot;name&quot;: item[&quot;name&quot;],
&quot;price&quot;: item[&quot;price&quot;][&quot;regularPrice&quot;][&quot;amount&quot;][&quot;value&quot;],
&quot;description&quot;: item[&quot;short_description&quot;][&quot;html&quot;]
}
if __name__ == &quot;__main__&quot;:
process = CrawlerProcess()
process.crawl(Nomin)
process.start()

p.s. the value I have for the number of pages might not be accurate. I just used what was visible at the bottom of the first page. Some of the categories might have more pages.


Edit:

To send the categories with the request you simply need to store the
category name in the dictionary with the id and number of pages, and then set it in the cb_kwargs parameter of each of the start_url requests.

for example:

categories = {
&quot;19653&quot;: {
&quot;pages&quot;: 4, 
&quot;name&quot;: &quot;Food&quot;
},
&quot;33456&quot;: {
&quot;pages&quot;: 12,
&quot;name&quot;: &quot;Outdoor&quot;
}
}
# This is fake information I made up for the example

and then in your start requests_method:

def start_requests(self):
for cat, val in categories.items():
for page in range(1, val[&quot;pages&quot;]):
url = .....
yield scrapy.Request(
url, 
callback=self.parse, 
cb_kwargs={&quot;category&quot;: val[&quot;name&quot;]}
)

Then in your parse method:

    def parse(self, response, category=None):
data = response.json()
if data and data[&#39;data&#39;] and data[&#39;data&#39;][&#39;products&#39;] and data[&#39;data&#39;][&#39;products&#39;][&#39;items&#39;]:
for item in data[&#39;data&#39;][&quot;products&quot;][&quot;items&quot;]:
yield {
&quot;category&quot;: category,
&quot;name&quot;: item[&quot;name&quot;],
&quot;price&quot;: item[&quot;price&quot;][&quot;regularPrice&quot;][&quot;amount&quot;][&quot;value&quot;],
&quot;special_price&quot;: item[&quot;special_price&quot;],
&quot;description&quot;: item[&quot;short_description&quot;][&quot;html&quot;]
}

huangapple
  • 本文由 发表于 2023年4月11日 10:59:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75982100.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定