如何将亚马逊上的图书信息转化为表格形式?

huangapple go评论112阅读模式
英文:

How to put book information from amazon into a table form?

问题

I have an amazon link, i.e.,

I'm able to extract Product details from the HTML with soup.select("#detailBullets_feature_div .a-unordered-list:first-child"), i.e.,

  1. <ul class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list">
  2. <li><span class="a-list-item"> <span class="a-text-bold">Publisher : </span> <span>Matrix Editions (January 1, 2007)</span> </span></li>
  3. <li><span class="a-list-item"> <span class="a-text-bold">Language : </span> <span>English</span> </span></li>
  4. <li><span class="a-list-item"> <span class="a-text-bold">Hardcover : </span> <span>640 pages</span> </span></li>
  5. <li><span class="a-list-item"> <span class="a-text-bold">ISBN-10 : </span> <span>0971576610</span> </span></li>
  6. <li><span class="a-list-item"> <span class="a-text-bold">ISBN-13 : </span> <span>978-0971576612</span> </span></li>
  7. <li><span class="a-list-item"> <span class="a-text-bold">Item Weight : </span> <span>2.3 pounds</span> </span></li>
  8. </ul>

Could you explain how to put the above data in a tabular form? My expected result is

  1. from bs4 import BeautifulSoup
  2. import pandas as pd
  3. from selenium import webdriver
  4. from selenium.webdriver.chrome.service import Service
  5. from webdriver_manager.chrome import ChromeDriverManager
  6. link = 'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610'
  7. driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()))
  8. driver.get(link)
  9. soup = BeautifulSoup(driver.page_source, 'html.parser')
  10. soup.select("#detailBullets_feature_div .a-unordered-list:first-child")
英文:

I have an amazon link, i.e.,
如何将亚马逊上的图书信息转化为表格形式?

I'm able to extract Product details from the HTML with soup.select(&quot;#detailBullets_feature_div .a-unordered-list:first-child&quot;), i.e.,

  1. &lt;ul class=&quot;a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list&quot;&gt;
  2. &lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Publisher : &lt;/span&gt; &lt;span&gt;Matrix Editions (January 1, 2007)&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
  3. &lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Language : &lt;/span&gt; &lt;span&gt;English&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
  4. &lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Hardcover : &lt;/span&gt; &lt;span&gt;640 pages&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
  5. &lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;ISBN-10 : &lt;/span&gt; &lt;span&gt;0971576610&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
  6. &lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;ISBN-13 : &lt;/span&gt; &lt;span&gt;978-0971576612&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
  7. &lt;li&gt;&lt;span class=&quot;a-list-item&quot;&gt; &lt;span class=&quot;a-text-bold&quot;&gt;Item Weight : &lt;/span&gt; &lt;span&gt;2.3 pounds&lt;/span&gt; &lt;/span&gt;&lt;/li&gt;
  8. &lt;/ul&gt;

Could you explain how to put above data in a tabular form? My expected result is
如何将亚马逊上的图书信息转化为表格形式?

  1. from bs4 import BeautifulSoup
  2. import pandas as pd
  3. from selenium import webdriver
  4. from selenium.webdriver.chrome.service import Service
  5. from webdriver_manager.chrome import ChromeDriverManager
  6. link = &#39;https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610&#39;
  7. driver = webdriver.Chrome(service = Service(ChromeDriverManager().install()))
  8. driver.get(link)
  9. soup = BeautifulSoup(driver.page_source, &#39;html.parser&#39;)
  10. soup.select(&quot;#detailBullets_feature_div .a-unordered-list:first-child&quot;)

答案1

得分: 1

你可以使用 span.a-text-bold 作为键和每个 span.a-list-item 中的其余文本作为值来构建一个字典(就像下面定义的函数中一样),然后可以使用该字典来形成数据框中的一行。

  1. def get_product_details(prod_soup):
  2. rkSel = 'span.a-list-item>span.a-text-bold:first-child'
  3. rkSel = f'div#detailBullets_feature_div {rkSel}'
  4. pDets = {}
  5. for r in prod_soup.select(rkSel):
  6. k = r.get_text(' ', strip=True)
  7. v = r.parent.get_text(' ', strip=True).replace(k, '', 1).strip()
  8. if 'Rank' in k and v[:1] == '#': v = ' • #'.join(v.split(' #'))
  9. pDets[k.split('\n')[0].strip(':')] = v
  10. return pDets

对于你的 linkget_product_details(soup) 应该返回

  1. { 'Publisher': 'Matrix Editions (January 1, 2007)',
  2. 'Language': 'English',
  3. 'Hardcover': '640 pages',
  4. 'ISBN-10': '0971576610',
  5. 'ISBN-13': '978-0971576612',
  6. 'Item Weight': '2.3 pounds',
  7. 'Best Sellers Rank': '#7,261,544 in Books ( See Top 100 in Books )',
  8. 'Customer Reviews': '5.0 out of 5 stars 2 ratings'}

pd.DataFrame([get_product_details(soup)]) 应该返回 如何将亚马逊上的图书信息转化为表格形式?


你可以循环遍历多个链接以填充多行:

  1. urls_list = [
  2. 'https://www.amazon.com/Atomic-Habits-Proven-Build-Break/dp/0735211299',
  3. 'https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610',
  4. 'https://www.amazon.com/Daisy-Jones-Taylor-Jenkins-Reid/dp/1524798649',
  5. 'https://www.amazon.com/Body-Keeps-Score-Healing-Trauma/dp/0143127748',
  6. 'https://www.amazon.com/History-Uncertain-Future-Handwriting/dp/1620402157'
  7. ]
  1. books_list = []
  2. for url in urls_list:
  3. soup = linkToSoup_scrapingAnt(url)
  4. kList, sList = ['Title', 'Author'], ['span#productTitle', 'span.author']
  5. book_info = { k: t.get_text(' ', strip=True) if t else None for k,t
  6. in zip(kList, [soup.select_one(s) for s in sList]) }
  7. books_list.append({**book_info, **get_product_details(soup)})

如何将亚马逊上的图书信息转化为表格形式?


注意: 如果你只想要问题中的 HTML 片段所涵盖的部分(即,不包括 Best Sellers RankCustomer Reviews 键/列),你可以将 rkSel 定义为:

  1. rkSel = 'ul:first-of-type span.a-list-item>span.a-text-bold:first-child'
  2. rkSel = f'h2+div#detailBullets_feature_div {rkSel}'
英文:

You can build a dictionary by using the span.a-text-bolds for keys and the rest of the texts in each span.a-list-item as values (like in the function defined below), and that dictionary can be used to form a row in a DataFrame.

  1. def get_product_details(prod_soup):
  2. rkSel = &#39;span.a-list-item&gt;span.a-text-bold:first-child&#39;
  3. rkSel = f&#39;div#detailBullets_feature_div {rkSel}&#39;
  4. pDets = {}
  5. for r in prod_soup.select(rkSel):
  6. k = r.get_text(&#39; &#39;, strip=True)
  7. v = r.parent.get_text(&#39; &#39;, strip=True).replace(k,&#39;&#39;,1).strip()
  8. if &#39;Rank&#39; in k and v[:1]==&#39;#&#39;: v = &#39; #&#39;.join(v.split(&#39; #&#39;))
  9. pDets[k.split(&#39;\n&#39;)[0].strip(&#39;:&#39;)] = v
  10. return pDets

For your link, get_product_details(soup) should return

>py
&gt; {&#39;Publisher&#39;: &#39;Matrix Editions (January 1, 2007)&#39;,
&gt; &#39;Language&#39;: &#39;English&#39;,
&gt; &#39;Hardcover&#39;: &#39;640 pages&#39;,
&gt; &#39;ISBN-10&#39;: &#39;0971576610&#39;,
&gt; &#39;ISBN-13&#39;: &#39;978-0971576612&#39;,
&gt; &#39;Item Weight&#39;: &#39;2.3 pounds&#39;,
&gt; &#39;Best Sellers Rank&#39;: &#39;#7,261,544 in Books ( See Top 100 in Books )&#39;,
&gt; &#39;Customer Reviews&#39;: &#39;5.0 out of 5 stars 2 ratings&#39;}
&gt;

and pd.DataFrame([get_product_details(soup)]) should return 如何将亚马逊上的图书信息转化为表格形式?


You could loop through multiple links to fill multiple rows:

  1. urls_list = [
  2. &#39;https://www.amazon.com/Atomic-Habits-Proven-Build-Break/dp/0735211299&#39;,
  3. &#39;https://www.amazon.com/Functional-Analysis-Dzung-Ha/dp/0971576610&#39;,
  4. &#39;https://www.amazon.com/Daisy-Jones-Taylor-Jenkins-Reid/dp/1524798649&#39;,
  5. &#39;https://www.amazon.com/Body-Keeps-Score-Healing-Trauma/dp/0143127748&#39;,
  6. &#39;https://www.amazon.com/History-Uncertain-Future-Handwriting/dp/1620402157&#39;
  7. ]
  1. books_list = []
  2. for url in urls_list:
  3. soup = linkToSoup_scrapingAnt(url)
  4. # books_list.append(get_product_details(soup)) ## no extra columns
  5. kList, sList = [&#39;Title&#39;,&#39;Author&#39;], [&#39;span#productTitle&#39;,&#39;span.author&#39;]
  6. book_info = { k: t.get_text(&#39; &#39;, strip=True) if t else None for k,t
  7. in zip(kList, [soup.select_one(s) for s in sList]) }
  8. books_list.append({**book_info, **get_product_details(soup)})

如何将亚马逊上的图书信息转化为表格形式?



Note: If you want only the section covered in the HTML snippet in your question (i.e., without the Best Sellers Rank and Customer Reviews keys/columns), you can define rkSel as

  1. rkSel = &#39;ul:first-of-type span.a-list-item&gt;span.a-text-bold:first-child&#39;
  2. rkSel = f&#39;h2+div#detailBullets_feature_div {rkSel}&#39;

huangapple
  • 本文由 发表于 2023年4月1日 00:07:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75900624.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定