2023年4月11日 10:59:31go评论110阅读模式

英文:

Efficiently scrape website by going through multiple different pages/categories

问题

I am having difficulties advancing my current scraping project/idea. I am attempting to web scrape all products on an online shop by category. The website link is: https://eshop.nomin.mn/.
Currently, with the aid of the great developers on this forum, I have been able to scrape the Food/Grocery category successfully using the online shop's data API (My code is provided at the very bottom of my post). While I could duplicate this success for other categories by changing the data API URL, I believe it would be very insufficient and inefficient.
Ideally I want to scrape all categories of the website using one spider rather than making a spider for each category. I do not know how I should go around doing this as my previous projects the websites main page had all the products listed, whereas this does not. Furthermore, adding multiple Data API URLs does not seem to be working for me. Each category has a different URL and a different Data API, for example:
Electric products (https://eshop.nomin.mn/6011.html)
Food products (https://eshop.nomin.mn/n-foods.html)
Building material (https://eshop.nomin.mn/n-building-materials-tools.html)
Automobile products and parts (https://eshop.nomin.mn/n-autoparts-tools.html)
The image below shows how you can browse the website and the categories (translated to English).
Ideally my scrapped end product would be a long table such as this. I have included Original Price and Listed Price separately as some categories such as the electric products have two pricing HTML as shown below.

1
,
899
,
990
₮

1
,
599
,
990
₮

My current working code that successfully scrapes the food product category and retrieves 3000+ products name, description, and price. *Also I think since I will be scraping multiple pages/categories maybe having a rotating/random generated header/user-agent would be smart. What would be the best way to integrate this idea?

import scrapy
from scrapy import Request
from datetime import datetime
BASE_URL = "https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&operationName=category&variables="
dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' Nomin CPI Foods Data'
class NominCPIFoodsSpider(scrapy.Spider):
    name = 'nomin_cpi_foods'
    allowed_domains = ['https://eshop.nomin.mn/n-foods.html/']
    custom_settings = {
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }
    # function used for start url
    def start_requests(self):
        for i in range(50):
            url = BASE_URL + '{"currentPage":' + str(i) + ',"id":24175,"filters":{"category_id":{"in":"24175"}},"pageSize":50,"sort":{"position":"DESC"}}'
            yield Request(url, self.parse)
    # function to parse
    def parse(self, response, **kwargs):
        data = response.json()
        print(data.keys())
        for item in data['data']["products"]["items"]:
            yield {
                "name": item["name"],
                "price": item["price"]["regularPrice"]["amount"]["value"],
                "description": item["short_description"]["html"]
            }
        # handles pagination
        next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
        if next_url:
            yield scrapy.Request(next_url, self.parse)
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(NominCPIFoodsSpider)
    process.start()

Sorry for the long post, and all and any help is greatly appreciated. Thank you very much. ^^


Note: This code appears to be in Python and is related to web scraping. If you have any specific questions or need further assistance with this code, please let me know.
<details>
<summary>英文:</summary>
I am having difficulties advancing my current scraping project/idea. I am attempting to web scrape all products on an online shop by category. The website link is: https://eshop.nomin.mn/.
Currently, with the aid of the great developers on this forum, I have been able to scrape the Food/Grocery category successfully using the online shop&#39;s data API (My code is provided at the very bottom of my post). While I could duplicate this success for other categories by changing the data API URL, I believe it would be very insufficient and inefficient.  
Ideally I want to scrape all categories of the website using one spider rather than making a spider for each category. I do not know how I should go around doing this as my previous projects the websites main page had all the products listed, whereas this does not. Furthermore, adding multiple Data API URLs does not seem to be working for me. Each category has a different URL and a different Data API, for example:
1. Electric products (https://eshop.nomin.mn/6011.html)
2. Food products (https://eshop.nomin.mn/n-foods.html)
3. Building material (https://eshop.nomin.mn/n-building-materials-tools.html)
4. Automobile products and parts (https://eshop.nomin.mn/n-autoparts-tools.html)
5. etc
The image below shows how you can browse the website and the categories (translated to English). 
[![enter image description here][1]][1]
Ideally my scrapped end product would be a long table such as this. I have included Original Price and Listed Price separately as some categories such as the electric products have two pricing HTML as shown below. 
&lt;div class=&quot;item-specialPricetag-1JM&quot;&gt;
&lt;span class=&quot;item-oldPrice-1sY&quot;&gt;
&lt;span&gt;1&lt;/span&gt;
&lt;span&gt;,&lt;/span&gt;
&lt;span&gt;899&lt;/span&gt;
&lt;span&gt;,&lt;/span&gt;
&lt;span&gt;990&lt;/span&gt;
&lt;span&gt;₮&lt;/span&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;div class=&quot;item-webSpecial-Z6W&quot;&gt;
&lt;span&gt;1&lt;/span&gt;
&lt;span&gt;,&lt;/span&gt;
&lt;span&gt;599&lt;/span&gt;
&lt;span&gt;,&lt;/span&gt;
&lt;span&gt;990&lt;/span&gt;
&lt;span&gt;₮&lt;/span&gt;
&lt;/div&gt;
[![enter image description here][2]][2]
My current working code that successfully scrapes the food product category and retrieves 3000+ products name, description, and price. *Also I think since I will be scraping multiple pages/categories maybe having a rotating/random generated header/user-agent would be smart. What would be the best way to integrate this idea?* 
import scrapy
from scrapy import Request
from datetime import datetime
BASE_URL = &quot;https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&amp;operationName=category&amp;variables=&quot;
dt_today = datetime.now().strftime(&#39;%Y%m%d&#39;)
filename = dt_today + &#39; Nomin CPI Foods Data&#39;
class NominCPIFoodsSpider(scrapy.Spider):
name = &#39;nomin_cpi_foods&#39;
allowed_domains = [&#39;https://eshop.nomin.mn/n-foods.html/&#39;]
custom_settings = {
&quot;FEEDS&quot;: {
f&#39;{filename}.csv&#39;: {
&#39;format&#39;: &#39;csv&#39;,
&#39;overwrite&#39;: True}}
}
# function used for start url
def start_requests(self):
for i in range(50):
url = BASE_URL + &#39;{&quot;currentPage&quot;:&#39; + str(i) + &#39;,&quot;id&quot;:24175,&quot;filters&quot;:{&quot;category_id&quot;:{&quot;in&quot;:&quot;24175&quot;}},&quot;pageSize&quot;:50,&quot;sort&quot;:{&quot;position&quot;:&quot;DESC&quot;}}&#39;
yield Request(url, self.parse)
# function to parse
def parse(self, response, **kwargs):
data = response.json()
print(data.keys())
for item in data[&#39;data&#39;][&quot;products&quot;][&quot;items&quot;]:
yield {
&quot;name&quot;: item[&quot;name&quot;],
&quot;price&quot;: item[&quot;price&quot;][&quot;regularPrice&quot;][&quot;amount&quot;][&quot;value&quot;],
&quot;description&quot;: item[&quot;short_description&quot;][&quot;html&quot;]
}
# handles pagination
next_url = response.css(&quot;nav.custom-pagination &gt; a.next::attr(href)&quot;).get()
if next_url:
yield scrapy.Request(next_url, self.parse)
if __name__ == &quot;__main__&quot;:
process = CrawlerProcess()
process.crawl(NominCPIFoodsSpider)
process.start()
Sorry for the long post, and all and any help is greatly appreciated. Thank you very much. ^^
[1]: https://i.stack.imgur.com/8jrhU.png
[2]: https://i.stack.imgur.com/rPWy2.png
</details>
# 答案1
**得分**: 2
以下是已翻译的代码部分：
```python
你可以做的是访问网站并浏览每个类别，获取该类别的 API URL，查看特定类别有多少页的信息，然后从 URL 中提取类别 ID 并在代码中创建一个将类别 ID 作为键和页数作为值的字典引用。
然后在你的 `start_requests` 方法中，不仅需要用一个变量替换当前页，还需要对类别做同样的操作。然后你几乎可以不改变其余部分。
有一件不必要的事情是继续解析实际的网页本身。你所需的所有信息都可以从 API 中获取，所以为不同页面生成请求并没有什么作用。
这是一个示例，使用网站上提供的一些类别。
...
要将类别与请求一起发送，你只需将类别名称存储在字典中，与 ID 和页数一起，并将其设置为每个 `start_url` 请求的 `cb_kwargs` 参数。
例如：
...
然后在你的 `start_requests` 方法中：
...
然后在你的 `parse` 方法中：
...

请注意，以上是已翻译的代码和说明部分。如果有其他问题或需要更多翻译，请告诉我。

英文:

What you can do is go to the website and visit each of the categories, grab the api url for that category, look to see how many pages of information that specific category has, and then extract the category ID out the URL and create a dictionary reference in your code that keeps the category ID as keys and the page number as values.

Then in your start_requests method, instead of only having to substitute the current page with a variable you can do the same for the category. Then you can pretty much leave the rest unchanged.

One thing that is unnecessary is to continue to parse the actual web pages themselves. ALl of the information you need is available from the API, so yielding requests for the different pages isn't really doing you any good.

Here is an example using a handful of the categories available on the site.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from datetime import datetime
categories = {
&quot;19653&quot;: 4,
&quot;24175&quot;: 67,
&quot;21297&quot;: 48,
&quot;19518&quot;: 16,
&quot;19487&quot;: 40,
&quot;26011&quot;: 46,
&quot;19767&quot;: 3,
&quot;19469&quot;: 5,
&quot;19451&quot;: 4
}
dt_today = datetime.now().strftime(&#39;%Y%m%d&#39;)
filename = dt_today + &#39; Nomin&#39;
class Nomin(scrapy.Spider):
name = &#39;nomin&#39;
custom_settings = {
&quot;FEEDS&quot;: {
f&#39;{filename}.csv&#39;: {
&#39;format&#39;: &#39;csv&#39;,
&#39;overwrite&#39;: True}}
}
def start_requests(self):
for cat, pages in categories.items():
for i in range(1, pages):
url = f&#39;https://eshop.nomin.mn/graphql?query=query+category%28%24pageSize%3AInt%21%24currentPage%3AInt%21%24filters%3AProductAttributeFilterInput%21%24sort%3AProductAttributeSortInput%29%7Bproducts%28pageSize%3A%24pageSize+currentPage%3A%24currentPage+filter%3A%24filters+sort%3A%24sort%29%7Bitems%7Bid+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal%7Bcreated_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename%7Dnew_to_date+short_description%7Bhtml+__typename%7DproductAttributes%7Bname+value+__typename%7Dprice%7BregularPrice%7Bamount%7Bcurrency+value+__typename%7D__typename%7D__typename%7Dspecial_price+special_to_date+thumbnail%7Bfile_small+url+__typename%7Durl_key+url_suffix+mp_label_data%7Benabled+name+priority+label_template+label_image+to_date+__typename%7D...on+ConfigurableProduct%7Bvariants%7Bproduct%7Bsku+special_price+price%7BregularPrice%7Bamount%7Bcurrency+value+__typename%7D__typename%7D__typename%7D__typename%7D__typename%7D__typename%7D__typename%7Dpage_info%7Btotal_pages+__typename%7Dtotal_count+__typename%7D%7D&amp;operationName=category&amp;variables=%7B%22currentPage%22%3A{i}%2C%22id%22%3A{cat}%2C%22filters%22%3A%7B%22category_id%22%3A%7B%22in%22%3A%22{cat}%22%7D%7D%2C%22pageSize%22%3A50%2C%22sort%22%3A%7B%22news_from_date%22%3A%22ASC%22%7D%7D&#39;
yield Request(url, self.parse)
def parse(self, response, **kwargs):
data = response.json()
if data and data[&#39;data&#39;] and data[&#39;data&#39;][&#39;products&#39;] and data[&#39;data&#39;][&#39;products&#39;][&#39;items&#39;]:
for item in data[&#39;data&#39;][&quot;products&quot;][&quot;items&quot;]:
yield {
&quot;name&quot;: item[&quot;name&quot;],
&quot;price&quot;: item[&quot;price&quot;][&quot;regularPrice&quot;][&quot;amount&quot;][&quot;value&quot;],
&quot;description&quot;: item[&quot;short_description&quot;][&quot;html&quot;]
}
if __name__ == &quot;__main__&quot;:
process = CrawlerProcess()
process.crawl(Nomin)
process.start()

p.s. the value I have for the number of pages might not be accurate. I just used what was visible at the bottom of the first page. Some of the categories might have more pages.

Edit:

To send the categories with the request you simply need to store the
category name in the dictionary with the id and number of pages, and then set it in the cb_kwargs parameter of each of the start_url requests.

for example:

categories = {
&quot;19653&quot;: {
&quot;pages&quot;: 4, 
&quot;name&quot;: &quot;Food&quot;
},
&quot;33456&quot;: {
&quot;pages&quot;: 12,
&quot;name&quot;: &quot;Outdoor&quot;
}
}
# This is fake information I made up for the example

and then in your start requests_method:

def start_requests(self):
for cat, val in categories.items():
for page in range(1, val[&quot;pages&quot;]):
url = .....
yield scrapy.Request(
url, 
callback=self.parse, 
cb_kwargs={&quot;category&quot;: val[&quot;name&quot;]}
)

Then in your parse method:

    def parse(self, response, category=None):
data = response.json()
if data and data[&#39;data&#39;] and data[&#39;data&#39;][&#39;products&#39;] and data[&#39;data&#39;][&#39;products&#39;][&#39;items&#39;]:
for item in data[&#39;data&#39;][&quot;products&quot;][&quot;items&quot;]:
yield {
&quot;category&quot;: category,
&quot;name&quot;: item[&quot;name&quot;],
&quot;price&quot;: item[&quot;price&quot;][&quot;regularPrice&quot;][&quot;amount&quot;][&quot;value&quot;],
&quot;special_price&quot;: item[&quot;special_price&quot;],
&quot;description&quot;: item[&quot;short_description&quot;][&quot;html&quot;]
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

高效地通过浏览多个不同的页面/类别来抓取网站数据。

问题

golang defining dict like python with and appending value to list in dict

Dataflow – 将 JSON 文件添加到 BigQuery

pandas.read_xml() 意外行为

Github检查API在将HTML作为评论文本传递时忽略了样式属性。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。