英文:
Python: Webcraper Pandas Dataframe Returning Multiple Empty Rows in Between Data
问题
所以我正在为工作构建一个eBay网络爬虫(我应该指出我对编程一窍不通,完全是通过互联网自学的),而且我已经让它运作了。我是用Python 3.11在Azure Data Studio的Jupyter Notebook中构建的。然而,在csv文件(因此也在Excel表格中)中返回了多个空行:
name,condition,price,options,shipping
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
['Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good'],['Good - Refurbished'],$149.00 to $199.00,['Buy It Now'],
,,,,
,,,,
,,,,
['Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good'],['Good - Refurbished'],$139.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
['Apple iPad 2nd 3rd 4th Generation 16GB 32GB 64GB 128GB PICK:GB - Color *Grade B*'],['Pre-Owned'],$64.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
等等……
这是我的代码:
import time
import requests
import pandas
import lxml
import selenium
import html5lib
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
# ...(代码太长,省略部分)...
for content in item_contents:
extracted_data = extract_data(content)
data.append(extracted_data)
df = pd.DataFrame(data)
df.to_csv("frame.csv", index=False)
现在,通过检查工具查看HTML,我发现问题所在。由于我只使用了“li”标签在“item_contents”变量中,它似乎试图拉取顶部的河流/走马灯的数据集(它位于相同的div类中并存储在一个“li”元素中),然后在每个项目卡中,可能会有一个“Top Rated”状态,其元素包括3个额外的“li”元素。
问题是,我实际上不知道如何解决这个问题?我尝试调整标签选择器以包括“data-viewport”位,但在By.CSS_SELECTOR或By.TAG_NAME中都似乎无效,如下所示:
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport]")
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport*='trackableId']")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport]")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport*='trackableId']")
这给我完全空白的数据框,而不是我想要的。我尝试搜索如何更好地选择我的CSS元素,但我很难获得我想要的东西,或者我找到的答案似乎更适用于不同于我的问题的问题。使用dropna可以清除这些空行,但我觉得应该有更好的方法选择标签或其他东西,以便我不会得到这样的数据?如果没有,我可以继续这样做。只是想学习如何更好地编程。任何帮助都将是极好的!谢谢!
英文:
So I'm building an eBay webscraper for work (I should note that I am incredibly new to programming in general, and am entirely self-taught using the internet), and I have made it functionin. I am building this with Python 3.11, in a Jupyter Notebook within Azure Data Studio. However, it returns in the csv (and consequently the Excel sheet) with multiple empty rows:
name,condition,price,options,shipping
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
['Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good'],['Good - Refurbished'],$149.00 to $199.00,['Buy It Now'],
,,,,
,,,,
,,,,
['Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good'],['Good - Refurbished'],$139.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
['Apple iPad 2nd 3rd 4th Generation 16GB 32GB 64GB 128GB PICK:GB - Color *Grade B*'],['Pre-Owned'],$64.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
etc. . .
Here is my code:
import time
import requests
import pandas
import lxml
import selenium
import html5lib
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.headless = True
options.page_load_strategy = 'none'
chrome_path = ChromeDriverManager().install()
s = Service(chrome_path)
driver = Chrome(options=options, service=s) # headers=headers once I can get it working again
driver.implicitly_wait(5)
browser = webdriver.Chrome(service=s)
# searchkey = input() <-- this commented out portion is for when I have got it more functional so that I can do a more dynamic url
# url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240'
url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240'
data = []
browser.get(url)
time.sleep(10)
content = browser.find_element(By.CSS_SELECTOR, "div[class*='srp-river-results']")
item_contents = content.find_elements(By.TAG_NAME, "li")
def extract_data(content):
name = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__title']>span")
if name:
name = [attr.text for attr in name]
else:
name = None
condition = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__subtitle']>span")
if condition:
condition = [attr.text for attr in condition]
else:
condition = None
price = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__price']")
if price:
price = price[0].text
else:
price = None
purchase_options = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__purchaseOptionsWithIcon']")
if purchase_options:
purchase_options = [attr.text for attr in purchase_options]
else:
purchase_options = None
shipping = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__logisticsCost']")
if shipping:
shipping = [attr.text for attr in shipping]
else:
shipping = None
return {
"name": name,
"condition": condition,
"price": price,
"options": purchase_options,
"shipping": shipping
}
for content in item_contents:
extracted_data = extract_data(content)
data.append(extracted_data)
df = pd.DataFrame(data)
df.to_csv("frame.csv", index=False)
Now, looking into the HTML with the Inspect tool, I discovered what I think the problem is. As I am using just the "li" tag in the "item_contents" variable, it seems to be attempting to pull the data sets for the river/carousel at the top (which is in the same div class and is stored in a "li" element), and then within each item card there is a potential for a "Top Rated" status, whose element includes 3 additional "li" elements.
The problem is, I don't actually know how to fix this? I attempted to adjust the tag selector to include the "data-viewport" bit, but that didn't seem to work in either By.CSS_SELECTOR or By.TAG_NAME, like so:
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport]")
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport*='trackableId']")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport]")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport*='trackableId']")
giving me entirely blank dataframes instead. I've tried searching how to better select my CSS elements, but I am struggling to get what I want, or at least the answers I've found seem to be geared towards different problems than mine. Using dropna works to just clear out those empty rows, but I feel like there must be a better way for me to select my tags or something so that I don't end up with data like this? If there isn't, though, I can just continue like that. Just wanting to learn how to better program, I suppose. Any assistance would be great! Thanks in advance!
答案1
得分: 2
更改您的选择策略,使用dict
而不是几个lists
:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text)
data = []
for e in soup.select('.srp-results li.s-item'):
data.append({
'name': e.select_one('div.s-item__title > span').text,
'condition': e.select_one('div.s-item__subtitle > span').text,
'price': e.select_one('span.s-item__price').text,
'purchase_options': e.select_one('span.s-item__purchaseOptionsWithIcon').text if e.select_one('span.s-item__purchaseOptionsWithIcon') else None,
'shipping': e.select_one('span.s-item__logisticsCost').text if e.select_one('span.s-item__logisticsCost') else None
})
pd.DataFrame(data)
输出
name | condition | price | purchase_options | shipping | |
---|---|---|---|---|---|
0 | Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good | Good - Refurbished | $139.99 to $199.99 | Buy It Now | +$19.40 shipping |
1 | Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good | Good - Refurbished | $149.00 to $199.00 | Buy It Now | Shipping not specified |
2 | Apple iPad 5 - 5th Gen 2017 Model 9.7" - 32GB 128GB Wi-Fi - Cellular - Good | Good - Refurbished | $118.99 | Buy It Now | +$19.09 shipping |
3 | Apple iPad Air 1st Gen A1474 32GB Wi-Fi 9.7in Tablet Space Gray iOS 12 - Good | Good - Refurbished | $89.99 | Buy It Now | +$18.65 shipping |
4 | 2021 Apple iPad 9th Gen 64/256GB WiFi 10.2" | Brand New | $335.00 to $485.00 | Buy It Now | +$34.87 shipping estimate |
... | |||||
250 | 2022 APPLE iPAD AIR 5TH GEN 10.9" 256GB STARLIGHT WI-FI TABLET MM9P3LL/A A2588 | Brand New | $650.00 | or Best Offer | +$21.45 shipping |
251 | Apple iPad 2 16GB, Wi-Fi, 9.7in - Black 7 pack | Pre-Owned | $17.50 | +$48.63 shipping estimate | |
252 | Apple iPad Air 4 (4th Gen) (10.9 inch) - 64GB - 256GB Wi-Fi + Cellular - Good | Good - Refurbished | $439.00 to $549.00 | Buy It Now | +$40.14 shipping estimate |
253 | Apple iPad Air 2 A1567 (WiFi + Cellular Unlocked) 64GB Space Gray (Very Good) | Very Good - Refurbished | $149.99 | Buy It Now | +$19.55 shipping |
254 | Apple iPad Pro, Bundle, 10.5-inch, 64GB, Space Gray, Wi-Fi Only, Original Box | Pre-Owned | $249.00 | Buy It Now | +$29.72 shipping estimate |
<details>
<summary>英文:</summary>
Change your selection strategy and use `dict` instead of several `lists`:
for content in browser.find_elements(By.CSS_SELECTOR, ".srp-results li.s-item"):
data.append({
'name' : content.find_element(By.CSS_SELECTOR, "div.s-item__title > span").text,
'condition' : content.find_element(By.CSS_SELECTOR, "div.s-item__subtitle > span").text,
'price' : content.find_element(By.CSS_SELECTOR, "span.s-item__price").text,
'purchase_options' : content.find_element(By.CSS_SELECTOR, "span.s-item__purchaseOptionsWithIcon").text if len(content.find_elements(By.CSS_SELECTOR, "span.s-item__purchaseOptionsWithIcon")) > 0 else None,
'shipping' : content.find_element(By.CSS_SELECTOR, "span.s-item__logisticsCost").text if len(content.find_elements(By.CSS_SELECTOR, "span.s-item__logisticsCost")) else None
})
But it do not need `selenium` overhead, simply use `requests`:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text)
data = []
for e in soup.select('.srp-results li.s-item'):
data.append({
'name' : e.select_one('div.s-item__title > span').text,
'condition' : e.select_one('div.s-item__subtitle > span').text,
'price' : e.select_one('span.s-item__price').text,
'purchase_options' : e.select_one('span.s-item__purchaseOptionsWithIcon').text if e.select_one('span.s-item__purchaseOptionsWithIcon') else None,
'shipping' : e.select_one('span.s-item__logisticsCost').text if e.select_one('span.s-item__logisticsCost') else None
})
pd.DataFrame(data)
#### Output
| | name | condition | price | purchase_options | shipping |
|----:|:-------------------------------------------------------------------------------|:------------------------|:-------------------|:-------------------|:--------------------------|
| 0 | Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good | Good - Refurbished | $139.99 to $199.99 | Buy It Now | +$19.40 shipping |
| 1 | Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good | Good - Refurbished | $149.00 to $199.00 | Buy It Now | Shipping not specified |
| 2 | Apple iPad 5 - 5th Gen 2017 Model 9.7" - 32GB 128GB Wi-Fi - Cellular - Good | Good - Refurbished | $118.99 | Buy It Now | +$19.09 shipping |
| 3 | Apple iPad Air 1st Gen A1474 32GB Wi-Fi 9.7in Tablet Space Gray iOS 12 - Good | Good - Refurbished | $89.99 | Buy It Now | +$18.65 shipping |
| 4 | 2021 Apple iPad 9th Gen 64/256GB WiFi 10.2" | Brand New | $335.00 to $485.00 | Buy It Now | +$34.87 shipping estimate |
|...
| 250 | 2022 APPLE iPAD AIR 5TH GEN 10.9" 256GB STARLIGHT WI-FI TABLET MM9P3LL/A A2588 | Brand New | $650.00 | or Best Offer | +$21.45 shipping |
| 251 | Apple iPad 2 16GB, Wi-Fi, 9.7in - Black 7 pack | Pre-Owned | $17.50 | | +$48.63 shipping estimate |
| 252 | Apple iPad Air 4 (4th Gen) (10.9 inch) - 64GB - 256GB Wi-Fi + Cellular - Good | Good - Refurbished | $439.00 to $549.00 | Buy It Now | +$40.14 shipping estimate |
| 253 | Apple iPad Air 2 A1567 (WiFi + Cellular Unlocked) 64GB Space Gray (Very Good) | Very Good - Refurbished | $149.99 | Buy It Now | +$19.55 shipping |
| 254 | Apple iPad Pro, Bundle, 10.5-inch, 64GB, Space Gray, Wi-Fi Only, Original Box | Pre-Owned | $249.00 | Buy It Now | +$29.72 shipping estimate |
</details>
# 答案2
**得分**: 1
基于HedgeHog的回答。我强烈推荐使用xpath和lxml库来解析HTML,而不是BeautifulSoup,因为它速度更快。
```py
import requests
import pandas as pd
from lxml import etree
response_text = requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text
root = etree.HTML(response_text)
items = root.xpath(".//ul[@class='srp-results srp-list clearfix']/li[@class='s-item s-item__pl-on-bottom']")
data = []
for item in items:
data.append({
"name": item.xpath(".//div[@class='s-item__title']//text()")[0],
"condition": item.xpath(".//div[@class='s-item__subtitle']/span/text()")[0],
"price": "".join(item.xpath(".//span[@class='s-item__price']//text()")),
"purchase_options": "".join(item.xpath(".//span[@class='s-item__dynamic s-item__purchaseOptionsWithIcon']//text()")),
"shipping": "".join(item.xpath(".//span[@class='s-item__shipping s-item__logisticsCost']//text()"))
})
df = pd.DataFrame(data)
比较如下:
英文:
Based on HedgeHog answer.
What I can highly recommend is using xpath and lxml library to parse html instead of BeautifulSoup, as it is much faster.
import requests
import pandas as pd
from lxml import etree
response_text = requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text
root = etree.HTML(response_text)
items = root.xpath(".//ul[@class='srp-results srp-list clearfix']/li[@class='s-item s-item__pl-on-bottom']")
data = []
for item in items:
data.append({
"name": item.xpath(".//div[@class='s-item__title']//text()")[0],
"condition": item.xpath(".//div[@class='s-item__subtitle']/span/text()")[0],
"price": "".join(item.xpath(".//span[@class='s-item__price']//text()")),
"purchase_options": "".join(item.xpath(".//span[@class='s-item__dynamic s-item__purchaseOptionsWithIcon']//text()")),
"shipping": "".join(item.xpath(".//span[@class='s-item__shipping s-item__logisticsCost']//text()"))
})
df = pd.DataFrame(data)
Comparison betwean
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论