Python WebScraper w/ BeautifulSoup: 未爬取所有页面

huangapple go评论91阅读模式
英文:

Python WebScraper w/ BeautifulSoup: Not Scraping All Pages

问题

我是一位全新的程序员,由我的公司负责制作一个用于eBay的网络爬虫,以帮助首席财务官在需要时找到库存物品。我已经开发了多页爬取功能,但是当Pandas DataFrame加载时,结果的数量与应该爬取的页面数不匹配。以下是代码(我仅使用iPads来进行测试,因为结果的数量和差异程度非常大):

import time
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

data = []

# searchkey = input()
# base_url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240
base_url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=60';

for page in range(1, 11):
    page_url = base_url + '&_pgn=' + str(page)
    time.sleep(10)
    soup = BeautifulSoup(requests.get(page_url).text)
    for links in soup.select('.srp-results li.s-item'):
        item_url = links.a['href']
        soup2 = BeautifulSoup(requests.get(item_url).text)
        for content in soup2.select('.lsp-c'):
            data.append({
                'item_name' : content.select_one('h1.x-item-title__mainTitle > span').text,
                'name' : 'Click Here to see Webpage',
                'url' : str(item_url),
                'hot' : "Hot!" if content.select_one('div.d-urgency') else "",
                'condition' : content.select_one('span.clipped').text,
                'price' : content.select_one('div.x-price-primary > span').text,
                'make offer' : 'Make Offer' if content.select_one('div.x-offer-action') else str('Contact Seller')
            })

df = pd.DataFrame(data)

df['link'] = df['name'] + '#' + df['url']
def make_clickable_both(val): 
    name, url = val.split('#')
    return f'<a href="{url}">{name}</a>' 

df2 = df.drop columns=['name', 'url'])
df2.style.format({'link': make_clickable_both})

这些结果看起来是这样的:

item_name hot condition price make offer link
0 Apple iPad Air 2 2nd WiFi + Ce... Hot! Good - Refurbished US $169.99 Contact Seller Click Here to see Webpage
1 Apple iPad 2nd 3rd 4th Generat... Hot! Used US $64.99 Contact Seller Click Here to see Webpage
2 Apple iPad 6th 9.7" 2018 Wifi ... Very Good - Refurbished US $189.85 Contact Seller Click Here to see Webpage
3 Apple iPad Air 1st 2nd Generat... Hot! Used US $54.89/ea Contact Seller Click Here to see Webpage
4 Apple 10.2" iPad 9th Generatio... Hot! Open box US $269.00 Contact Seller Click Here to see Webpage
...
300 Apple iPad 8th 10.2" Wifi or... Good - Refurbished US $229.85 Contact Seller Click Here to see Webpage

这很棒

英文:

I'm a brand new coder who was tasked (by my company) with making a web scraper for eBay, to assist the CFO in finding inventory items when we need them. I've got it developed to scrape from multiple pages, but when the Pandas DataFrame loads, the number of results does not match how many pages it's supposed to be scraping. Here is the code (I am using iPads just for the sheer volume and degree of variance in the results):

import time
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

data = []

# searchkey = input()
# base_url = &#39;https://www.ebay.com/sch/i.html?_nkw=&#39; + searchkey + &#39;&amp;_sacat=0&amp;_ipg=240
base_url = &#39;https://www.ebay.com/sch/i.html?_nkw=ipads&amp;_sacat=0&amp;_ipg=60&#39;

for page in range(1, 11):
    page_url = base_url + &#39;&amp;_pgn=&#39; + str(page)
    time.sleep(10)
    soup = BeautifulSoup(requests.get(page_url).text)
    for links in soup.select(&#39;.srp-results li.s-item&#39;):
        item_url = links.a[&#39;href&#39;]
        soup2 = BeautifulSoup(requests.get(item_url).text)
        for content in soup2.select(&#39;.lsp-c&#39;):
            data.append({
                &#39;item_name&#39; : content.select_one(&#39;h1.x-item-title__mainTitle &gt; span&#39;).text,
                &#39;name&#39; : &#39;Click Here to see Webpage&#39;,
                &#39;url&#39; : str(item_url),
                &#39;hot&#39; : &quot;Hot!&quot; if content.select_one(&#39;div.d-urgency&#39;) else &quot;&quot;,
                &#39;condition&#39; : content.select_one(&#39;span.clipped&#39;).text,
                &#39;price&#39; : content.select_one(&#39;div.x-price-primary &gt; span&#39;).text,
                &#39;make offer&#39; : &#39;Make Offer&#39; if content.select_one(&#39;div.x-offer-action&#39;) else str(&#39;Contact Seller&#39;)
            })

df = pd.DataFrame(data)

df[&#39;link&#39;] = df[&#39;name&#39;] + &#39;#&#39; + df[&#39;url&#39;]
def make_clickable_both(val): 
    name, url = val.split(&#39;#&#39;)
    return f&#39;&lt;a href=&quot;{url}&quot;&gt;{name}&lt;/a&gt;&#39; 

df2 = df.drop(columns=[&#39;name&#39;, &#39;url&#39;])
df2.style.format({&#39;link&#39;: make_clickable_both})

The results of these appear like so:

item_name hot condition price make offer link
0 Apple iPad Air 2 2nd WiFi + Ce... Hot! Good - Refurbished US $169.99 Contact Seller Click Here to see Webpage
1 Apple iPad 2nd 3rd 4th Generat... Hot! Used US $64.99 Contact Seller Click Here to see Webpage
2 Apple iPad 6th 9.7" 2018 Wifi ... Very Good - Refurbished US $189.85 Contact Seller Click Here to see Webpage
3 Apple iPad Air 1st 2nd Generat... Hot! Used US $54.89/ea Contact Seller Click Here to see Webpage
4 Apple 10.2" iPad 9th Generatio... Hot! Open box US $269.00 Contact Seller Click Here to see Webpage
...
300 Apple iPad 8th 10.2" Wifi or... Good - Refurbished US $229.85 Contact Seller Click Here to see Webpage

Which is great! That last column is even a clickable link, just as the function defines, and operates properly. However, based off of my URL it's just about half the data I should have received.

So, in the URL, the two key things related to this are page_url = base_url + &#39;&amp;_pgn=&#39; + str(page), which is how I determine the page number for each URL to get the list of links off of, and &amp;_ipg=60, which is what determines how many items are loaded on the page (eBay has 3 options for this: 60, 120, 240). So based on my current settings (pagination giving me 10 pages and item amount set to 60), I should be seeing roughly 600 results or so, but Instead I got 300. I added the timer to see if letting it load for a little longer or something between each page would help me get all the results, but I've had no such luck. Anyone got ideas about what I did wrong, or what I can do to improve? Any bit of info is appreciated!

答案1

得分: 3

从第5页开始,页面似乎以不同的方式呈现,soup.select('.srp-results li.s-item')始终返回一个空列表(网址)。

这就是为什么data长度保持在300,尽管有更多的结果。

因此,您的代码没有问题,也没有必要暂停10秒。

保持代码不变,您最好的选择是将&amp;_ipg设置为240,这样您可以获得更多,如果不是全部的结果(经过一段时间后):

print(df.info())
# 输出
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   item_name   1020 non-null   object
 1   name        1020 non-null   object
 2   url         1020 non-null   object
 3   hot         1020 non-null   object
 4   condition   1020 non-null   object
 5   price       1020 non-null   object
 6   make offer  1020 non-null   object
dtypes: object(7)
memory usage: 55.9+ KB
英文:

Starting page 5, pages seem to be rendered differently and soup.select(&#39;.srp-results li.s-item&#39;) always returns an empty list (of urls).

That is why data length remains stuck at 300, even though there are more results.

So, there is nothing wrong with your code and there is no need to pause for 10 seconds.

Leaving the code unchanged, your best option is to set &amp;_ipg to 240, you get more, if not all, results (after a certain time):

print(df.info())
# Output
&lt;class &#39;pandas.core.frame.DataFrame&#39;&gt;
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   item_name   1020 non-null   object
 1   name        1020 non-null   object
 2   url         1020 non-null   object
 3   hot         1020 non-null   object
 4   condition   1020 non-null   object
 5   price       1020 non-null   object
 6   make offer  1020 non-null   object
dtypes: object(7)
memory usage: 55.9+ KB

答案2

得分: 0

我实际上深入研究了解析HTML时弹出的内容,发现这是eBay因为超过5页的结果而拒绝了机器人的访问!所以,我修改了我的代码以添加:

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
soup = BeautifulSoup(requests.get(base_url, headers=headers).text)

实际上解决了这个问题!本应该知道的。

英文:

I actually dug more into what popped up when parsing the HTML, and discovered it was because of eBay denying access passed 5 pages of results to bots! So, changing my code to add:

headers = {&#39;user-agent&#39;: &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36&#39;}
soup = BeautifulSoup(requests.get(base_url, headers=headers).text)

it actually fixes the issue! Should have known.

huangapple
  • 本文由 发表于 2023年3月4日 04:08:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75631463.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定