英文:
How to put book information from amazon into a table form?
问题
I have an amazon link, i.e.,
I'm able to extract Product details
from the HTML with soup.select("#detailBullets_feature_div .a-unordered-list:first-child")
, i.e.,
<ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list">
<li><span class="a-list-item"> <span class="a-text-bold">Publisher : </span> <span>Matrix Editions (January 1, 2007)</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Language : </span> <span>English</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Hardcover : </span> <span>640 pages</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">ISBN-10 : </span> <span>0971576610</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">ISBN-13 : </span> <span>978-0971576612</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Item Weight : </span> <span>2.3 pounds</span> </span></li>
</ul>
Could you explain how to put the above data in a tabular form? My expected result is
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
link = 'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610'
driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()))
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'html.parser')
soup.select("#detailBullets_feature_div .a-unordered-list:first-child")
英文:
I have an amazon link, i.e.,
I'm able to extract Product details
from the HTML with soup.select("#detailBullets_feature_div .a-unordered-list:first-child")
, i.e.,
<ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list">
<li><span class="a-list-item"> <span class="a-text-bold">Publisher : </span> <span>Matrix Editions (January 1, 2007)</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Language : </span> <span>English</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Hardcover : </span> <span>640 pages</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">ISBN-10 : </span> <span>0971576610</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">ISBN-13 : </span> <span>978-0971576612</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Item Weight : </span> <span>2.3 pounds</span> </span></li>
</ul>
Could you explain how to put above data in a tabular form? My expected result is
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
link = 'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610'
driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()))
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'html.parser')
soup.select("#detailBullets_feature_div .a-unordered-list:first-child")
答案1
得分: 1
你可以使用 span.a-text-bold
作为键和每个 span.a-list-item
中的其余文本作为值来构建一个字典(就像下面定义的函数中一样),然后可以使用该字典来形成数据框中的一行。
def get_product_details(prod_soup):
rkSel = 'span.a-list-item>span.a-text-bold:first-child'
rkSel = f'div#detailBullets_feature_div {rkSel}'
pDets = {}
for r in prod_soup.select(rkSel):
k = r.get_text(' ', strip=True)
v = r.parent.get_text(' ', strip=True).replace(k, '', 1).strip()
if 'Rank' in k and v[:1] == '#': v = ' • #'.join(v.split(' #'))
pDets[k.split('\n')[0].strip(':')] = v
return pDets
对于你的 link
,get_product_details(soup)
应该返回
{ 'Publisher': 'Matrix Editions (January 1, 2007)',
'Language': 'English',
'Hardcover': '640 pages',
'ISBN-10': '0971576610',
'ISBN-13': '978-0971576612',
'Item Weight': '2.3 pounds',
'Best Sellers Rank': '#7,261,544 in Books ( See Top 100 in Books )',
'Customer Reviews': '5.0 out of 5 stars 2 ratings'}
而 pd.DataFrame([get_product_details(soup)])
应该返回
你可以循环遍历多个链接以填充多行:
urls_list = [
'https://www.amazon.com/Atomic-Habits-Proven-Build-Break/dp/0735211299',
'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610',
'https://www.amazon.com/Daisy-Jones-Taylor-Jenkins-Reid/dp/1524798649',
'https://www.amazon.com/Body-Keeps-Score-Healing-Trauma/dp/0143127748',
'https://www.amazon.com/History-Uncertain-Future-Handwriting/dp/1620402157'
]
books_list = []
for url in urls_list:
soup = linkToSoup_scrapingAnt(url)
kList, sList = ['Title', 'Author'], ['span#productTitle', 'span.author']
book_info = { k: t.get_text(' ', strip=True) if t else None for k,t
in zip(kList, [soup.select_one(s) for s in sList]) }
books_list.append({**book_info, **get_product_details(soup)})
注意: 如果你只想要问题中的 HTML 片段所涵盖的部分(即,不包括 Best Sellers Rank
和 Customer Reviews
键/列),你可以将 rkSel
定义为:
rkSel = 'ul:first-of-type span.a-list-item>span.a-text-bold:first-child'
rkSel = f'h2+div#detailBullets_feature_div {rkSel}'
英文:
You can build a dictionary by using the span.a-text-bold
s for keys and the rest of the texts in each span.a-list-item
as values (like in the function defined below), and that dictionary can be used to form a row in a DataFrame.
def get_product_details(prod_soup):
rkSel = 'span.a-list-item>span.a-text-bold:first-child'
rkSel = f'div#detailBullets_feature_div {rkSel}'
pDets = {}
for r in prod_soup.select(rkSel):
k = r.get_text(' ', strip=True)
v = r.parent.get_text(' ', strip=True).replace(k,'',1).strip()
if 'Rank' in k and v[:1]=='#': v = ' • #'.join(v.split(' #'))
pDets[k.split('\n')[0].strip(':')] = v
return pDets
For your link
, get_product_details(soup)
should return
>py
> {'Publisher': 'Matrix Editions (January 1, 2007)',
> 'Language': 'English',
> 'Hardcover': '640 pages',
> 'ISBN-10': '0971576610',
> 'ISBN-13': '978-0971576612',
> 'Item Weight': '2.3 pounds',
> 'Best Sellers Rank': '#7,261,544 in Books ( See Top 100 in Books )',
> 'Customer Reviews': '5.0 out of 5 stars 2 ratings'}
>
and pd.DataFrame([get_product_details(soup)])
should return
You could loop through multiple links to fill multiple rows:
urls_list = [
'https://www.amazon.com/Atomic-Habits-Proven-Build-Break/dp/0735211299',
'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610',
'https://www.amazon.com/Daisy-Jones-Taylor-Jenkins-Reid/dp/1524798649',
'https://www.amazon.com/Body-Keeps-Score-Healing-Trauma/dp/0143127748',
'https://www.amazon.com/History-Uncertain-Future-Handwriting/dp/1620402157'
]
books_list = []
for url in urls_list:
soup = linkToSoup_scrapingAnt(url)
# books_list.append(get_product_details(soup)) ## no extra columns
kList, sList = ['Title','Author'], ['span#productTitle','span.author']
book_info = { k: t.get_text(' ', strip=True) if t else None for k,t
in zip(kList, [soup.select_one(s) for s in sList]) }
books_list.append({**book_info, **get_product_details(soup)})
Note: If you want only the section covered in the HTML snippet in your question (i.e., without the Best Sellers Rank
and Customer Reviews
keys/columns), you can define rkSel
as
rkSel = 'ul:first-of-type span.a-list-item>span.a-text-bold:first-child'
rkSel = f'h2+div#detailBullets_feature_div {rkSel}'
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论