2023年4月1日 00:07:29go评论112阅读模式

英文:

How to put book information from amazon into a table form?

问题

I have an amazon link, i.e.,

I'm able to extract Product details from the HTML with soup.select("#detailBullets_feature_div .a-unordered-list:first-child"), i.e.,

<ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list">
<li><span class="a-list-item"> <span class="a-text-bold">Publisher : </span> <span>Matrix Editions (January 1, 2007)</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Language : </span> <span>English</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Hardcover : </span> <span>640 pages</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">ISBN-10 : </span> <span>0971576610</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">ISBN-13 : </span> <span>978-0971576612</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Item Weight : </span> <span>2.3 pounds</span> </span></li>
</ul>

Could you explain how to put the above data in a tabular form? My expected result is

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
link = 'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610'
driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()))
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'html.parser')
soup.select("#detailBullets_feature_div .a-unordered-list:first-child")

英文:

I have an amazon link, i.e.,

I'm able to extract Product details from the HTML with soup.select("#detailBullets_feature_div .a-unordered-list:first-child"), i.e.,

&lt;ul class=&quot;a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list&quot;&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Publisher : &lt;/span&gt; &lt;span&gt;Matrix Editions (January 1, 2007)&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Language : &lt;/span&gt; &lt;span&gt;English&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Hardcover : &lt;/span&gt; &lt;span&gt;640 pages&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;ISBN-10 : &lt;/span&gt; &lt;span&gt;0971576610&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;ISBN-13 : &lt;/span&gt; &lt;span&gt;978-0971576612&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Item Weight : &lt;/span&gt; &lt;span&gt;2.3 pounds&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;

Could you explain how to put above data in a tabular form? My expected result is

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
link = &#39;https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610&#39;
driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()))
driver.get(link)
soup = BeautifulSoup(driver.page_source, &#39;html.parser&#39;)
soup.select(&quot;#detailBullets_feature_div .a-unordered-list:first-child&quot;)

答案1

得分: 1

你可以使用 span.a-text-bold 作为键和每个 span.a-list-item 中的其余文本作为值来构建一个字典（就像下面定义的函数中一样），然后可以使用该字典来形成数据框中的一行。

def get_product_details(prod_soup):
    rkSel = 'span.a-list-item>span.a-text-bold:first-child'
    rkSel = f'div#detailBullets_feature_div {rkSel}'
    pDets = {}
    for r in prod_soup.select(rkSel):
        k = r.get_text(' ', strip=True)
        v = r.parent.get_text(' ', strip=True).replace(k, '', 1).strip()
        if 'Rank' in k and v[:1] == '#': v = ' • #'.join(v.split(' #'))
        pDets[k.split('\n')[0].strip(':')] = v
    return pDets

对于你的 link，get_product_details(soup) 应该返回

{ 'Publisher': 'Matrix Editions (January 1, 2007)',
  'Language': 'English',
  'Hardcover': '640 pages',
  'ISBN-10': '0971576610',
  'ISBN-13': '978-0971576612',
  'Item Weight': '2.3 pounds',
  'Best Sellers Rank': '#7,261,544 in Books ( See Top 100 in Books )',
  'Customer Reviews': '5.0 out of 5 stars 2 ratings'}

而 pd.DataFrame([get_product_details(soup)]) 应该返回

你可以循环遍历多个链接以填充多行：

urls_list = [
    'https://www.amazon.com/Atomic-Habits-Proven-Build-Break/dp/0735211299',
    'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610',
    'https://www.amazon.com/Daisy-Jones-Taylor-Jenkins-Reid/dp/1524798649',
    'https://www.amazon.com/Body-Keeps-Score-Healing-Trauma/dp/0143127748',
    'https://www.amazon.com/History-Uncertain-Future-Handwriting/dp/1620402157'
]

books_list = []
for url in urls_list:
    soup = linkToSoup_scrapingAnt(url)
    kList, sList = ['Title', 'Author'], ['span#productTitle', 'span.author']
    book_info = { k: t.get_text(' ', strip=True) if t else None for k,t 
                  in zip(kList, [soup.select_one(s) for s in sList])    }
    books_list.append({**book_info, **get_product_details(soup)})

注意： 如果你只想要问题中的 HTML 片段所涵盖的部分（即，不包括 Best Sellers Rank 和 Customer Reviews 键/列），你可以将 rkSel 定义为：

    rkSel = 'ul:first-of-type span.a-list-item>span.a-text-bold:first-child'
    rkSel = f'h2+div#detailBullets_feature_div {rkSel}'

英文:

You can build a dictionary by using the span.a-text-bolds for keys and the rest of the texts in each span.a-list-item as values (like in the function defined below), and that dictionary can be used to form a row in a DataFrame.

def get_product_details(prod_soup):
    rkSel = &#39;span.a-list-item&gt;span.a-text-bold:first-child&#39;
    rkSel = f&#39;div#detailBullets_feature_div {rkSel}&#39;
    pDets = {}
    for r in prod_soup.select(rkSel):
        k = r.get_text(&#39; &#39;, strip=True) 
        v = r.parent.get_text(&#39; &#39;, strip=True).replace(k,&#39;&#39;,1).strip() 
        if &#39;Rank&#39; in k and v[:1]==&#39;#&#39;: v = &#39; • #&#39;.join(v.split(&#39; #&#39;))
        pDets[k.split(&#39;\n&#39;)[0].strip(&#39;:&#39;)] = v
    return pDets

For your link, get_product_details(soup) should return

>py > {'Publisher': 'Matrix Editions (January 1, 2007)', > 'Language': 'English', > 'Hardcover': '640 pages', > 'ISBN-10': '0971576610', > 'ISBN-13': '978-0971576612', > 'Item Weight': '2.3 pounds', > 'Best Sellers Rank': '#7,261,544 in Books ( See Top 100 in Books )', > 'Customer Reviews': '5.0 out of 5 stars 2 ratings'} >

and pd.DataFrame([get_product_details(soup)]) should return

You could loop through multiple links to fill multiple rows:

urls_list = [
    &#39;https://www.amazon.com/Atomic-Habits-Proven-Build-Break/dp/0735211299&#39;,
    &#39;https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610&#39;,
    &#39;https://www.amazon.com/Daisy-Jones-Taylor-Jenkins-Reid/dp/1524798649&#39;,
    &#39;https://www.amazon.com/Body-Keeps-Score-Healing-Trauma/dp/0143127748&#39;,
    &#39;https://www.amazon.com/History-Uncertain-Future-Handwriting/dp/1620402157&#39;
]

books_list = []
for url in urls_list:
    soup = linkToSoup_scrapingAnt(url)
    # books_list.append(get_product_details(soup)) ## no extra columns
    kList, sList = [&#39;Title&#39;,&#39;Author&#39;], [&#39;span#productTitle&#39;,&#39;span.author&#39;]
    book_info = { k: t.get_text(&#39; &#39;, strip=True) if t else None for k,t 
                  in zip(kList, [soup.select_one(s) for s in sList])    }
    books_list.append({**book_info, **get_product_details(soup)})

Note: If you want only the section covered in the HTML snippet in your question (i.e., without the Best Sellers Rank and Customer Reviews keys/columns), you can define rkSel as

    rkSel = &#39;ul:first-of-type span.a-list-item&gt;span.a-text-bold:first-child&#39;
    rkSel = f&#39;h2+div#detailBullets_feature_div {rkSel}&#39;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将亚马逊上的图书信息转化为表格形式？

问题

答案1

如何从字典创建一个类变量

Python Selenium: 使用自定义属性点击元素

使用JavaScript更改按钮文本，取决于body元素使用的class。

在HTML和CSS中，显示在最左边和最右边的部分。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。