2023年3月4日 04:08:24go评论221阅读模式

英文:

Python WebScraper w/ BeautifulSoup: Not Scraping All Pages

问题

我是一位全新的程序员，由我的公司负责制作一个用于eBay的网络爬虫，以帮助首席财务官在需要时找到库存物品。我已经开发了多页爬取功能，但是当Pandas DataFrame加载时，结果的数量与应该爬取的页面数不匹配。以下是代码（我仅使用iPads来进行测试，因为结果的数量和差异程度非常大）：

import time
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

data = []

# searchkey = input()
# base_url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&amp;_sacat=0&amp;_ipg=240
base_url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&amp;_sacat=0&amp;_ipg=60';

for page in range(1, 11):
    page_url = base_url + '&amp;_pgn=' + str(page)
    time.sleep(10)
    soup = BeautifulSoup(requests.get(page_url).text)
    for links in soup.select('.srp-results li.s-item'):
        item_url = links.a['href']
        soup2 = BeautifulSoup(requests.get(item_url).text)
        for content in soup2.select('.lsp-c'):
            data.append({
                'item_name' : content.select_one('h1.x-item-title__mainTitle > span').text,
                'name' : 'Click Here to see Webpage',
                'url' : str(item_url),
                'hot' : "Hot!" if content.select_one('div.d-urgency') else "",
                'condition' : content.select_one('span.clipped').text,
                'price' : content.select_one('div.x-price-primary > span').text,
                'make offer' : 'Make Offer' if content.select_one('div.x-offer-action') else str('Contact Seller')
            })

df = pd.DataFrame(data)

df['link'] = df['name'] + '#' + df['url']
def make_clickable_both(val): 
    name, url = val.split('#')
    return f'<a href="{url}">{name}</a>' 

df2 = df.drop columns=['name', 'url'])
df2.style.format({'link': make_clickable_both})

这些结果看起来是这样的：

	item_name	hot	condition	price	make offer	link
0	Apple iPad Air 2 2nd WiFi + Ce...	Hot!	Good - Refurbished	US $169.99	Contact Seller	Click Here to see Webpage
1	Apple iPad 2nd 3rd 4th Generat...	Hot!	Used	US $64.99	Contact Seller	Click Here to see Webpage
2	Apple iPad 6th 9.7" 2018 Wifi ...		Very Good - Refurbished	US $189.85	Contact Seller	Click Here to see Webpage
3	Apple iPad Air 1st 2nd Generat...	Hot!	Used	US $54.89/ea	Contact Seller	Click Here to see Webpage
4	Apple 10.2" iPad 9th Generatio...	Hot!	Open box	US $269.00	Contact Seller	Click Here to see Webpage
...
300	Apple iPad 8th 10.2" Wifi or...		Good - Refurbished	US $229.85	Contact Seller	Click Here to see Webpage

这很棒

英文:

I'm a brand new coder who was tasked (by my company) with making a web scraper for eBay, to assist the CFO in finding inventory items when we need them. I've got it developed to scrape from multiple pages, but when the Pandas DataFrame loads, the number of results does not match how many pages it's supposed to be scraping. Here is the code (I am using iPads just for the sheer volume and degree of variance in the results):

import time
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

data = []

# searchkey = input()
# base_url = &#39;https://www.ebay.com/sch/i.html?_nkw=&#39; + searchkey + &#39;&amp;_sacat=0&amp;_ipg=240
base_url = &#39;https://www.ebay.com/sch/i.html?_nkw=ipads&amp;_sacat=0&amp;_ipg=60&#39;

for page in range(1, 11):
    page_url = base_url + &#39;&amp;_pgn=&#39; + str(page)
    time.sleep(10)
    soup = BeautifulSoup(requests.get(page_url).text)
    for links in soup.select(&#39;.srp-results li.s-item&#39;):
        item_url = links.a[&#39;href&#39;]
        soup2 = BeautifulSoup(requests.get(item_url).text)
        for content in soup2.select(&#39;.lsp-c&#39;):
            data.append({
                &#39;item_name&#39; : content.select_one(&#39;h1.x-item-title__mainTitle &gt; span&#39;).text,
                &#39;name&#39; : &#39;Click Here to see Webpage&#39;,
                &#39;url&#39; : str(item_url),
                &#39;hot&#39; : &quot;Hot!&quot; if content.select_one(&#39;div.d-urgency&#39;) else &quot;&quot;,
                &#39;condition&#39; : content.select_one(&#39;span.clipped&#39;).text,
                &#39;price&#39; : content.select_one(&#39;div.x-price-primary &gt; span&#39;).text,
                &#39;make offer&#39; : &#39;Make Offer&#39; if content.select_one(&#39;div.x-offer-action&#39;) else str(&#39;Contact Seller&#39;)
            })

df = pd.DataFrame(data)

df[&#39;link&#39;] = df[&#39;name&#39;] + &#39;#&#39; + df[&#39;url&#39;]
def make_clickable_both(val): 
    name, url = val.split(&#39;#&#39;)
    return f&#39;&lt;a href=&quot;{url}&quot;&gt;{name}&lt;/a&gt;&#39; 

df2 = df.drop(columns=[&#39;name&#39;, &#39;url&#39;])
df2.style.format({&#39;link&#39;: make_clickable_both})

The results of these appear like so:

	item_name	hot	condition	price	make offer	link
0	Apple iPad Air 2 2nd WiFi + Ce...	Hot!	Good - Refurbished	US $169.99	Contact Seller	Click Here to see Webpage
1	Apple iPad 2nd 3rd 4th Generat...	Hot!	Used	US $64.99	Contact Seller	Click Here to see Webpage
2	Apple iPad 6th 9.7" 2018 Wifi ...		Very Good - Refurbished	US $189.85	Contact Seller	Click Here to see Webpage
3	Apple iPad Air 1st 2nd Generat...	Hot!	Used	US $54.89/ea	Contact Seller	Click Here to see Webpage
4	Apple 10.2" iPad 9th Generatio...	Hot!	Open box	US $269.00	Contact Seller	Click Here to see Webpage
...
300	Apple iPad 8th 10.2" Wifi or...		Good - Refurbished	US $229.85	Contact Seller	Click Here to see Webpage

Which is great! That last column is even a clickable link, just as the function defines, and operates properly. However, based off of my URL it's just about half the data I should have received.

So, in the URL, the two key things related to this are page_url = base_url + '&_pgn=' + str(page), which is how I determine the page number for each URL to get the list of links off of, and &_ipg=60, which is what determines how many items are loaded on the page (eBay has 3 options for this: 60, 120, 240). So based on my current settings (pagination giving me 10 pages and item amount set to 60), I should be seeing roughly 600 results or so, but Instead I got 300. I added the timer to see if letting it load for a little longer or something between each page would help me get all the results, but I've had no such luck. Anyone got ideas about what I did wrong, or what I can do to improve? Any bit of info is appreciated!

答案1

得分: 3

从第5页开始，页面似乎以不同的方式呈现，soup.select('.srp-results li.s-item')始终返回一个空列表（网址）。

这就是为什么data长度保持在300，尽管有更多的结果。

因此，您的代码没有问题，也没有必要暂停10秒。

保持代码不变，您最好的选择是将&_ipg设置为240，这样您可以获得更多，如果不是全部的结果（经过一段时间后）:

print(df.info())
# 输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   item_name   1020 non-null   object
 1   name        1020 non-null   object
 2   url         1020 non-null   object
 3   hot         1020 non-null   object
 4   condition   1020 non-null   object
 5   price       1020 non-null   object
 6   make offer  1020 non-null   object
dtypes: object(7)
memory usage: 55.9+ KB

英文:

Starting page 5, pages seem to be rendered differently and soup.select('.srp-results li.s-item') always returns an empty list (of urls).

That is why data length remains stuck at 300, even though there are more results.

So, there is nothing wrong with your code and there is no need to pause for 10 seconds.

Leaving the code unchanged, your best option is to set &_ipg to 240, you get more, if not all, results (after a certain time):

print(df.info())
# Output
&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   item_name   1020 non-null   object
 1   name        1020 non-null   object
 2   url         1020 non-null   object
 3   hot         1020 non-null   object
 4   condition   1020 non-null   object
 5   price       1020 non-null   object
 6   make offer  1020 non-null   object
dtypes: object(7)
memory usage: 55.9+ KB

答案2

得分: 0

我实际上深入研究了解析HTML时弹出的内容，发现这是eBay因为超过5页的结果而拒绝了机器人的访问！所以，我修改了我的代码以添加：

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
soup = BeautifulSoup(requests.get(base_url, headers=headers).text)

实际上解决了这个问题！本应该知道的。

英文:

I actually dug more into what popped up when parsing the HTML, and discovered it was because of eBay denying access passed 5 pages of results to bots! So, changing my code to add:

headers = {&#39;user-agent&#39;: &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36&#39;}
soup = BeautifulSoup(requests.get(base_url, headers=headers).text)

it actually fixes the issue! Should have known.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python WebScraper w/ BeautifulSoup: 未爬取所有页面

问题

答案1

答案2

在数据框中通过另一列上的条件搜索数值。

为什么我不能将一个装饰过的函数传递给scipy.integrate.ode？

Swift中与Python中的2D数组/列表的“for”语法等效。

Selenium为什么不能将“click”应用于弹出窗口？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论