Python:网络爬虫 Pandas 数据框在数据之间返回多个空行

huangapple go评论71阅读模式
英文:

Python: Webcraper Pandas Dataframe Returning Multiple Empty Rows in Between Data

问题

所以我正在为工作构建一个eBay网络爬虫(我应该指出我对编程一窍不通,完全是通过互联网自学的),而且我已经让它运作了。我是用Python 3.11在Azure Data Studio的Jupyter Notebook中构建的。然而,在csv文件(因此也在Excel表格中)中返回了多个空行:

name,condition,price,options,shipping
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
['Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good'],['Good - Refurbished'],$149.00 to $199.00,['Buy It Now'],
,,,,
,,,,
,,,,
['Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good'],['Good - Refurbished'],$139.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
['Apple iPad 2nd 3rd 4th Generation 16GB 32GB 64GB 128GB PICK:GB - Color *Grade B*'],['Pre-Owned'],$64.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
等等……

这是我的代码:

import time
import requests
import pandas
import lxml
import selenium
import html5lib

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# ...(代码太长,省略部分)...

for content in item_contents:
    extracted_data = extract_data(content)
    data.append(extracted_data)
df = pd.DataFrame(data)
df.to_csv("frame.csv", index=False)

现在,通过检查工具查看HTML,我发现问题所在。由于我只使用了“li”标签在“item_contents”变量中,它似乎试图拉取顶部的河流/走马灯的数据集(它位于相同的div类中并存储在一个“li”元素中),然后在每个项目卡中,可能会有一个“Top Rated”状态,其元素包括3个额外的“li”元素。

问题是,我实际上不知道如何解决这个问题?我尝试调整标签选择器以包括“data-viewport”位,但在By.CSS_SELECTOR或By.TAG_NAME中都似乎无效,如下所示:

item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport]")
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport*='trackableId']")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport]")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport*='trackableId']")

这给我完全空白的数据框,而不是我想要的。我尝试搜索如何更好地选择我的CSS元素,但我很难获得我想要的东西,或者我找到的答案似乎更适用于不同于我的问题的问题。使用dropna可以清除这些空行,但我觉得应该有更好的方法选择标签或其他东西,以便我不会得到这样的数据?如果没有,我可以继续这样做。只是想学习如何更好地编程。任何帮助都将是极好的!谢谢!

英文:

So I'm building an eBay webscraper for work (I should note that I am incredibly new to programming in general, and am entirely self-taught using the internet), and I have made it functionin. I am building this with Python 3.11, in a Jupyter Notebook within Azure Data Studio. However, it returns in the csv (and consequently the Excel sheet) with multiple empty rows:

name,condition,price,options,shipping
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
['Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good'],['Good - Refurbished'],$149.00 to $199.00,['Buy It Now'],
,,,,
,,,,
,,,,
['Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good'],['Good - Refurbished'],$139.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
['Apple iPad 2nd 3rd 4th Generation 16GB 32GB 64GB 128GB PICK:GB - Color *Grade B*'],['Pre-Owned'],$64.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
etc. . . 

Here is my code:

import time

import requests
import pandas
import lxml
import selenium
import html5lib

import pandas as pd
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
options.headless = True 
options.page_load_strategy = 'none'
chrome_path = ChromeDriverManager().install()

s = Service(chrome_path)
driver = Chrome(options=options, service=s) # headers=headers once I can get it working again

driver.implicitly_wait(5)
browser = webdriver.Chrome(service=s)

# searchkey = input() <-- this commented out portion is for when I have got it more functional so that I can do a more dynamic url
# url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240'
url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240'

data = []

browser.get(url)
time.sleep(10)

content = browser.find_element(By.CSS_SELECTOR, "div[class*='srp-river-results']")
item_contents = content.find_elements(By.TAG_NAME, "li")

def extract_data(content):
    name = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__title']>span")
    if name:
        name = [attr.text for attr in name]
    else:
        name = None
    
    condition = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__subtitle']>span")
    if condition:
        condition = [attr.text for attr in condition]
    else:
        condition = None

    price = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__price']")
    if price:
        price = price[0].text
    else:
        price = None

    purchase_options = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__purchaseOptionsWithIcon']")
    if purchase_options:
        purchase_options = [attr.text for attr in purchase_options]
    else:
        purchase_options = None

    shipping = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__logisticsCost']")
    if shipping:
        shipping = [attr.text for attr in shipping]
    else:
        shipping = None
    
    return {
		"name": name,
		"condition": condition,
		"price": price,
		"options": purchase_options,
		"shipping": shipping
	}

for content in item_contents:
	extracted_data = extract_data(content)
	data.append(extracted_data)
df = pd.DataFrame(data)
df.to_csv("frame.csv", index=False)

Now, looking into the HTML with the Inspect tool, I discovered what I think the problem is. As I am using just the "li" tag in the "item_contents" variable, it seems to be attempting to pull the data sets for the river/carousel at the top (which is in the same div class and is stored in a "li" element), and then within each item card there is a potential for a "Top Rated" status, whose element includes 3 additional "li" elements.

The problem is, I don't actually know how to fix this? I attempted to adjust the tag selector to include the "data-viewport" bit, but that didn't seem to work in either By.CSS_SELECTOR or By.TAG_NAME, like so:

item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport]")
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport*='trackableId']")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport]")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport*='trackableId']")

giving me entirely blank dataframes instead. I've tried searching how to better select my CSS elements, but I am struggling to get what I want, or at least the answers I've found seem to be geared towards different problems than mine. Using dropna works to just clear out those empty rows, but I feel like there must be a better way for me to select my tags or something so that I don't end up with data like this? If there isn't, though, I can just continue like that. Just wanting to learn how to better program, I suppose. Any assistance would be great! Thanks in advance!

答案1

得分: 2

更改您的选择策略,使用dict而不是几个lists

import requests
import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text)
data = []

for e in soup.select('.srp-results li.s-item'):
    data.append({
        'name': e.select_one('div.s-item__title > span').text,
        'condition': e.select_one('div.s-item__subtitle > span').text,
        'price': e.select_one('span.s-item__price').text,
        'purchase_options': e.select_one('span.s-item__purchaseOptionsWithIcon').text if e.select_one('span.s-item__purchaseOptionsWithIcon') else None,
        'shipping': e.select_one('span.s-item__logisticsCost').text if e.select_one('span.s-item__logisticsCost') else None
    })

pd.DataFrame(data)

输出

name condition price purchase_options shipping
0 Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good Good - Refurbished $139.99 to $199.99 Buy It Now +$19.40 shipping
1 Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good Good - Refurbished $149.00 to $199.00 Buy It Now Shipping not specified
2 Apple iPad 5 - 5th Gen 2017 Model 9.7" - 32GB 128GB Wi-Fi - Cellular - Good Good - Refurbished $118.99 Buy It Now +$19.09 shipping
3 Apple iPad Air 1st Gen A1474 32GB Wi-Fi 9.7in Tablet Space Gray iOS 12 - Good Good - Refurbished $89.99 Buy It Now +$18.65 shipping
4 2021 Apple iPad 9th Gen 64/256GB WiFi 10.2" Brand New $335.00 to $485.00 Buy It Now +$34.87 shipping estimate
...
250 2022 APPLE iPAD AIR 5TH GEN 10.9" 256GB STARLIGHT WI-FI TABLET MM9P3LL/A A2588 Brand New $650.00 or Best Offer +$21.45 shipping
251 Apple iPad 2 16GB, Wi-Fi, 9.7in - Black 7 pack Pre-Owned $17.50 +$48.63 shipping estimate
252 Apple iPad Air 4 (4th Gen) (10.9 inch) - 64GB - 256GB Wi-Fi + Cellular - Good Good - Refurbished $439.00 to $549.00 Buy It Now +$40.14 shipping estimate
253 Apple iPad Air 2 A1567 (WiFi + Cellular Unlocked) 64GB Space Gray (Very Good) Very Good - Refurbished $149.99 Buy It Now +$19.55 shipping
254 Apple iPad Pro, Bundle, 10.5-inch, 64GB, Space Gray, Wi-Fi Only, Original Box Pre-Owned $249.00 Buy It Now +$29.72 shipping estimate

<details>
<summary>英文:</summary>

Change your selection strategy and use `dict` instead of several `lists`:

    for content in browser.find_elements(By.CSS_SELECTOR, &quot;.srp-results li.s-item&quot;):
        data.append({
            &#39;name&#39; : content.find_element(By.CSS_SELECTOR, &quot;div.s-item__title &gt; span&quot;).text,
            &#39;condition&#39; : content.find_element(By.CSS_SELECTOR, &quot;div.s-item__subtitle &gt; span&quot;).text,
            &#39;price&#39; : content.find_element(By.CSS_SELECTOR, &quot;span.s-item__price&quot;).text,
            &#39;purchase_options&#39; : content.find_element(By.CSS_SELECTOR, &quot;span.s-item__purchaseOptionsWithIcon&quot;).text if len(content.find_elements(By.CSS_SELECTOR, &quot;span.s-item__purchaseOptionsWithIcon&quot;)) &gt; 0 else None,
            &#39;shipping&#39; : content.find_element(By.CSS_SELECTOR, &quot;span.s-item__logisticsCost&quot;).text if len(content.find_elements(By.CSS_SELECTOR, &quot;span.s-item__logisticsCost&quot;)) else None
        })


But it do not need `selenium` overhead, simply use `requests`:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(requests.get(&#39;https://www.ebay.com/sch/i.html?_nkw=ipads&amp;_sacat=0&amp;_ipg=240&#39;).text)
    data = []
    
    for e in soup.select(&#39;.srp-results li.s-item&#39;):
        data.append({
            &#39;name&#39; : e.select_one(&#39;div.s-item__title &gt; span&#39;).text,
            &#39;condition&#39; : e.select_one(&#39;div.s-item__subtitle &gt; span&#39;).text,
            &#39;price&#39; : e.select_one(&#39;span.s-item__price&#39;).text,
            &#39;purchase_options&#39; : e.select_one(&#39;span.s-item__purchaseOptionsWithIcon&#39;).text if  e.select_one(&#39;span.s-item__purchaseOptionsWithIcon&#39;) else None,
            &#39;shipping&#39; : e.select_one(&#39;span.s-item__logisticsCost&#39;).text if e.select_one(&#39;span.s-item__logisticsCost&#39;) else None
        })
    
    pd.DataFrame(data)

#### Output

|     | name                                                                           | condition               | price              | purchase_options   | shipping                  |
|----:|:-------------------------------------------------------------------------------|:------------------------|:-------------------|:-------------------|:--------------------------|
|   0 | Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good      | Good - Refurbished      | $139.99 to $199.99 | Buy It Now         | +$19.40 shipping          |
|   1 | Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good      | Good - Refurbished      | $149.00 to $199.00 | Buy It Now         | Shipping not specified    |
|   2 | Apple iPad 5 - 5th Gen 2017 Model 9.7&quot; - 32GB 128GB Wi-Fi - Cellular - Good    | Good - Refurbished      | $118.99            | Buy It Now         | +$19.09 shipping          |
|   3 | Apple iPad Air 1st Gen A1474 32GB Wi-Fi 9.7in Tablet Space Gray iOS 12 - Good  | Good - Refurbished      | $89.99             | Buy It Now         | +$18.65 shipping          |
|   4 | 2021 Apple iPad 9th Gen 64/256GB WiFi 10.2&quot;                                    | Brand New               | $335.00 to $485.00 | Buy It Now         | +$34.87 shipping estimate |
|...
| 250 | 2022 APPLE iPAD AIR 5TH GEN 10.9&quot; 256GB STARLIGHT WI-FI TABLET MM9P3LL/A A2588 | Brand New               | $650.00            | or Best Offer      | +$21.45 shipping          |
| 251 | Apple iPad 2 16GB, Wi-Fi, 9.7in - Black  7 pack                                | Pre-Owned               | $17.50             |                    | +$48.63 shipping estimate |
| 252 | Apple iPad Air 4 (4th Gen) (10.9 inch) - 64GB - 256GB Wi-Fi + Cellular - Good  | Good - Refurbished      | $439.00 to $549.00 | Buy It Now         | +$40.14 shipping estimate |
| 253 | Apple iPad Air 2 A1567 (WiFi + Cellular Unlocked) 64GB Space Gray (Very Good)  | Very Good - Refurbished | $149.99            | Buy It Now         | +$19.55 shipping          |
| 254 | Apple iPad Pro, Bundle, 10.5-inch, 64GB, Space Gray, Wi-Fi Only, Original Box  | Pre-Owned               | $249.00            | Buy It Now         | +$29.72 shipping estimate |

</details>



# 答案2
**得分**: 1

基于HedgeHog的回答。我强烈推荐使用xpath和lxml库来解析HTML,而不是BeautifulSoup,因为它速度更快。

```py
import requests
import pandas as pd
from lxml import etree

response_text = requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&amp;_sacat=0&amp;_ipg=240').text

root = etree.HTML(response_text)

items = root.xpath(".//ul[@class='srp-results srp-list clearfix']/li[@class='s-item s-item__pl-on-bottom']")
data = []
for item in items:
    data.append({
        "name": item.xpath(".//div[@class='s-item__title']//text()")[0],
        "condition": item.xpath(".//div[@class='s-item__subtitle']/span/text()")[0],
        "price": "".join(item.xpath(".//span[@class='s-item__price']//text()")),
        "purchase_options": "".join(item.xpath(".//span[@class='s-item__dynamic s-item__purchaseOptionsWithIcon']//text()")),
        "shipping": "".join(item.xpath(".//span[@class='s-item__shipping s-item__logisticsCost']//text()"))
    })

df = pd.DataFrame(data)

比较如下:

Python:网络爬虫 Pandas 数据框在数据之间返回多个空行

英文:

Based on HedgeHog answer.

What I can highly recommend is using xpath and lxml library to parse html instead of BeautifulSoup, as it is much faster.

import requests
import pandas as pd
from lxml import etree

response_text = requests.get(&#39;https://www.ebay.com/sch/i.html?_nkw=ipads&amp;_sacat=0&amp;_ipg=240&#39;).text

root = etree.HTML(response_text)

items = root.xpath(&quot;.//ul[@class=&#39;srp-results srp-list clearfix&#39;]/li[@class=&#39;s-item s-item__pl-on-bottom&#39;]&quot;)
data = []
for item in items:
        data.append({
        &quot;name&quot;: item.xpath(&quot;.//div[@class=&#39;s-item__title&#39;]//text()&quot;)[0],
        &quot;condition&quot;: item.xpath(&quot;.//div[@class=&#39;s-item__subtitle&#39;]/span/text()&quot;)[0],
        &quot;price&quot;: &quot;&quot;.join(item.xpath(&quot;.//span[@class=&#39;s-item__price&#39;]//text()&quot;)),
        &quot;purchase_options&quot;: &quot;&quot;.join(item.xpath(&quot;.//span[@class=&#39;s-item__dynamic s-item__purchaseOptionsWithIcon&#39;]//text()&quot;)),
        &quot;shipping&quot;: &quot;&quot;.join(item.xpath(&quot;.//span[@class=&#39;s-item__shipping s-item__logisticsCost&#39;]//text()&quot;))
    })

df = pd.DataFrame(data)

Comparison betwean

Python:网络爬虫 Pandas 数据框在数据之间返回多个空行

huangapple
  • 本文由 发表于 2023年2月14日 02:14:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75439743.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定