如何将亚马逊上的图书信息转化为表格形式?

huangapple go评论81阅读模式
英文:

How to put book information from amazon into a table form?

问题

I have an amazon link, i.e.,

I'm able to extract Product details from the HTML with soup.select("#detailBullets_feature_div .a-unordered-list:first-child"), i.e.,

<ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list">
<li><span class="a-list-item"> <span class="a-text-bold">Publisher : </span> <span>Matrix Editions (January 1, 2007)</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Language : </span> <span>English</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Hardcover : </span> <span>640 pages</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">ISBN-10 : </span> <span>0971576610</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">ISBN-13 : </span> <span>978-0971576612</span> </span></li>
<li><span class="a-list-item"> <span class="a-text-bold">Item Weight : </span> <span>2.3 pounds</span> </span></li>
</ul>

Could you explain how to put the above data in a tabular form? My expected result is

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

link = 'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610'
driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()))
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'html.parser')
soup.select("#detailBullets_feature_div .a-unordered-list:first-child")
英文:

I have an amazon link, i.e.,
如何将亚马逊上的图书信息转化为表格形式?

I'm able to extract Product details from the HTML with soup.select(&quot;#detailBullets_feature_div .a-unordered-list:first-child&quot;), i.e.,

&lt;ul class=&quot;a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list&quot;&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Publisher : &lt;/span&gt; &lt;span&gt;Matrix Editions (January 1, 2007)&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Language : &lt;/span&gt; &lt;span&gt;English&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Hardcover : &lt;/span&gt; &lt;span&gt;640 pages&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;ISBN-10 : &lt;/span&gt; &lt;span&gt;0971576610&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;ISBN-13 : &lt;/span&gt; &lt;span&gt;978-0971576612&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Item Weight : &lt;/span&gt; &lt;span&gt;2.3 pounds&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;

Could you explain how to put above data in a tabular form? My expected result is
如何将亚马逊上的图书信息转化为表格形式?

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

link = &#39;https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610&#39;
driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()))
driver.get(link)
soup = BeautifulSoup(driver.page_source, &#39;html.parser&#39;)
soup.select(&quot;#detailBullets_feature_div .a-unordered-list:first-child&quot;)

答案1

得分: 1

你可以使用 span.a-text-bold 作为键和每个 span.a-list-item 中的其余文本作为值来构建一个字典(就像下面定义的函数中一样),然后可以使用该字典来形成数据框中的一行。

def get_product_details(prod_soup):
    rkSel = 'span.a-list-item>span.a-text-bold:first-child'
    rkSel = f'div#detailBullets_feature_div {rkSel}'

    pDets = {}
    for r in prod_soup.select(rkSel):
        k = r.get_text(' ', strip=True)
        v = r.parent.get_text(' ', strip=True).replace(k, '', 1).strip()
        if 'Rank' in k and v[:1] == '#': v = ' • #'.join(v.split(' #'))

        pDets[k.split('\n')[0].strip(':')] = v
    return pDets

对于你的 linkget_product_details(soup) 应该返回

{ 'Publisher': 'Matrix Editions (January 1, 2007)',
  'Language': 'English',
  'Hardcover': '640 pages',
  'ISBN-10': '0971576610',
  'ISBN-13': '978-0971576612',
  'Item Weight': '2.3 pounds',
  'Best Sellers Rank': '#7,261,544 in Books ( See Top 100 in Books )',
  'Customer Reviews': '5.0 out of 5 stars 2 ratings'}

pd.DataFrame([get_product_details(soup)]) 应该返回 如何将亚马逊上的图书信息转化为表格形式?


你可以循环遍历多个链接以填充多行:

urls_list = [
    'https://www.amazon.com/Atomic-Habits-Proven-Build-Break/dp/0735211299',
    'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610',
    'https://www.amazon.com/Daisy-Jones-Taylor-Jenkins-Reid/dp/1524798649',
    'https://www.amazon.com/Body-Keeps-Score-Healing-Trauma/dp/0143127748',
    'https://www.amazon.com/History-Uncertain-Future-Handwriting/dp/1620402157'
]
books_list = []
for url in urls_list:
    soup = linkToSoup_scrapingAnt(url)
    kList, sList = ['Title', 'Author'], ['span#productTitle', 'span.author']
    book_info = { k: t.get_text(' ', strip=True) if t else None for k,t 
                  in zip(kList, [soup.select_one(s) for s in sList])    }
    books_list.append({**book_info, **get_product_details(soup)})

如何将亚马逊上的图书信息转化为表格形式?


注意: 如果你只想要问题中的 HTML 片段所涵盖的部分(即,不包括 Best Sellers RankCustomer Reviews 键/列),你可以将 rkSel 定义为:

    rkSel = 'ul:first-of-type span.a-list-item>span.a-text-bold:first-child'
    rkSel = f'h2+div#detailBullets_feature_div {rkSel}'
英文:

You can build a dictionary by using the span.a-text-bolds for keys and the rest of the texts in each span.a-list-item as values (like in the function defined below), and that dictionary can be used to form a row in a DataFrame.

def get_product_details(prod_soup):
    rkSel = &#39;span.a-list-item&gt;span.a-text-bold:first-child&#39;
    rkSel = f&#39;div#detailBullets_feature_div {rkSel}&#39;

    pDets = {}
    for r in prod_soup.select(rkSel):
        k = r.get_text(&#39; &#39;, strip=True) 
        v = r.parent.get_text(&#39; &#39;, strip=True).replace(k,&#39;&#39;,1).strip() 
        if &#39;Rank&#39; in k and v[:1]==&#39;#&#39;: v = &#39; • #&#39;.join(v.split(&#39; #&#39;))

        pDets[k.split(&#39;\n&#39;)[0].strip(&#39;:&#39;)] = v
    return pDets 

For your link, get_product_details(soup) should return

>py
&gt; {&#39;Publisher&#39;: &#39;Matrix Editions (January 1, 2007)&#39;,
&gt; &#39;Language&#39;: &#39;English&#39;,
&gt; &#39;Hardcover&#39;: &#39;640 pages&#39;,
&gt; &#39;ISBN-10&#39;: &#39;0971576610&#39;,
&gt; &#39;ISBN-13&#39;: &#39;978-0971576612&#39;,
&gt; &#39;Item Weight&#39;: &#39;2.3 pounds&#39;,
&gt; &#39;Best Sellers Rank&#39;: &#39;#7,261,544 in Books ( See Top 100 in Books )&#39;,
&gt; &#39;Customer Reviews&#39;: &#39;5.0 out of 5 stars 2 ratings&#39;}
&gt;

and pd.DataFrame([get_product_details(soup)]) should return 如何将亚马逊上的图书信息转化为表格形式?


You could loop through multiple links to fill multiple rows:

urls_list = [
    &#39;https://www.amazon.com/Atomic-Habits-Proven-Build-Break/dp/0735211299&#39;,
    &#39;https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610&#39;,
    &#39;https://www.amazon.com/Daisy-Jones-Taylor-Jenkins-Reid/dp/1524798649&#39;,
    &#39;https://www.amazon.com/Body-Keeps-Score-Healing-Trauma/dp/0143127748&#39;,
    &#39;https://www.amazon.com/History-Uncertain-Future-Handwriting/dp/1620402157&#39;
]
books_list = []
for url in urls_list:
    soup = linkToSoup_scrapingAnt(url)
    # books_list.append(get_product_details(soup)) ## no extra columns

    kList, sList = [&#39;Title&#39;,&#39;Author&#39;], [&#39;span#productTitle&#39;,&#39;span.author&#39;]
    book_info = { k: t.get_text(&#39; &#39;, strip=True) if t else None for k,t 
                  in zip(kList, [soup.select_one(s) for s in sList])    }
    books_list.append({**book_info, **get_product_details(soup)})

如何将亚马逊上的图书信息转化为表格形式?



Note: If you want only the section covered in the HTML snippet in your question (i.e., without the Best Sellers Rank and Customer Reviews keys/columns), you can define rkSel as

    rkSel = &#39;ul:first-of-type span.a-list-item&gt;span.a-text-bold:first-child&#39;
    rkSel = f&#39;h2+div#detailBullets_feature_div {rkSel}&#39;

huangapple
  • 本文由 发表于 2023年4月1日 00:07:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75900624.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定