2023年3月10日 01:22:09go评论106阅读模式

英文:

How to scrape related category of element using BeautifulSoup?

问题

从这个网站https://bulkfollows.com/services进行网页抓取。我想要获取每个服务行，其中包含以下特性：`'ID'，'Service'，'Rate per 1000'，'Min / Max'，'Refill'，'Avg. Time'，'Description'，'category'`。我已经获取了除了category列以外的所有内容。category列是一个父特性，类似于以下内容：
    " YouTube - Watch Time By Length" 或 "Instagram - Followers [ From ✓VERIFIED ACCOUNTS]"
这是我的代码：
```python
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml")
categories = dict((e.get('data-filter-category-id'), e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))
data = []
for e in soup.select("#serviceList tr:has(td)"):
    d = dict(zip(e.find_previous('thead').stripped_strings, e.stripped_strings))
    d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
    data.append(d)
pd.DataFrame(data)[['ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill', 'Avg. Time', 'Description', 'category']]

我需要在获取父列时在for循环中进行一些帮助。
这是我的输出：

我希望category列不是None，并且当你点击例如第一个服务时，我希望它是：

> > 链接：https://youtube.com/video 开始：0-12小时速度：100-200每天补货：30天
>
> 请注意：观看时间将在分析中1-3天内更新。
> 在交付后的3天内，如果观看时间没有更新，请
> 请截取您视频分析的屏幕截图（不是货币化页面，
> 我们不保证货币化）并将其上传到prntscr.com然后
> 将上传的截图发送给我们）。
1: https://i.stack.imgur.com/QLTJp.png


<details>
<summary>英文:</summary>
I am trying to webscrape this site https://bulkfollows.com/services 
What I want is to get every service row which has features like this: `&#39;ID&#39;,  &#39;Service&#39;, &#39;Rate per 1000&#39;, &#39;Min / Max&#39;, &#39;Refill&#39;,&#39;Avg. Time&#39;,&#39;Description&#39;,&#39;category&#39;` I got everything except category column a category column is a parent feature which is like these :
   
    &quot; YouTube - Watch Time By Length&quot; or &quot;Instagram - Followers [ From  ✓VERIFIED ACCOUNTS]&quot;
This is my code :

from bs4 import BeautifulSoup
import pandas as pd
import requests

url="https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml")

categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))

data= []
for e in soup.select("#serviceList tr:has(td)"):
d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
data.append(d)

pd.DataFrame(data)[['ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','category']]

I need some help in the for loop for getting parent columns 
this is my output : [![enter image description here][1]][1]
I want the category column not none and the description when you click for example in first service I want it to be:
&gt; &gt; Link: https://youtube.com/video Start: 0-12hrs Speed: 100-200 Per day Refill: 30 days
&gt; 
&gt; Please Note: Watch time will take 1-3 days to update on analytics.
&gt; After 3 days of delivery, if the watch time does not update, please
&gt; take a screenshot of your video analytic ( Not the Monetization page,
&gt; we don&#39;t guarantee Monetization ) and upload it to prntscr.com and
&gt; send it us the uploaded screenshot ).
  [1]: https://i.stack.imgur.com/QLTJp.png
</details>
# 答案1
**得分**: 2
```markdown
没有适用于所有情况的抓取方法 - 因此，您必须选择更具体的元素，可以查看文档以了解一些[查找策略][1]。
将以下行替换为：
    d[&#39;category&#39;] = categories[e.get(&#39;data-filter-table-category-id&#39;)] if e.get(&#39;data-filter-table-category-id&#39;) else None
使用以下行，它将查找前面的`&lt;h4&gt;`以获取`Category`，并查找下一个模态框以获取`Description`：
    d[&#39;Category&#39;] = e.find_previous(&#39;h4&#39;).get_text(strip=True)
    d[&#39;Description&#39;] = e.find(&#39;div&#39;,{&#39;class&#39;:&#39;modal-body&#39;}).get_text(&#39; &#39;,strip=True)
##### 示例
    from bs4 import BeautifulSoup
    import pandas as pd
    import requests
    
    url=&quot;https://bulkfollows.com/services&quot;
    soup = BeautifulSoup(requests.get(url).content, &quot;lxml&quot;) 
    
    categories = dict((e.get(&#39;data-filter-category-id&#39;),e.get(&#39;data-filter-category-name&#39;)) for e in soup.select(&#39;.dropdown-menu button[data-filter-category-name]&#39;))
    
    data= []
    for e in soup.select(&quot;#serviceList tr:has(td)&quot;):    
        d = dict(zip(e.find_previous(&#39;thead&#39;).stripped_strings,e.stripped_strings))
        d[&#39;Category&#39;] = e.find_previous(&#39;h4&#39;).get_text(strip=True)
        d[&#39;Description&#39;] = e.find(&#39;div&#39;,{&#39;class&#39;:&#39;modal-body&#39;}).get_text(&#39; &#39;,strip=True)
        data.append(d)
    
    pd.DataFrame(data)[[&#39;ID&#39;,  &#39;Service&#39;, &#39;Rate per 1000&#39;, &#39;Min / Max&#39;, &#39;Refill&#39;,&#39;Avg. Time&#39;,&#39;Description&#39;,&#39;Category&#39;]]

英文:

There is no one fits all approach for scraping - So you have to select your elements more specific, may check the docs for some finding strategies.

Replace the line:

d[&#39;category&#39;] = categories[e.get(&#39;data-filter-table-category-id&#39;)] if e.get(&#39;data-filter-table-category-id&#39;) else None

with following, that will take a look to previous <h4> to grab the Category and to the next modal to get the Description:

d[&#39;Category&#39;] = e.find_previous(&#39;h4&#39;).get_text(strip=True)
d[&#39;Description&#39;] = e.find(&#39;div&#39;,{&#39;class&#39;:&#39;modal-body&#39;}).get_text(&#39; &#39;,strip=True)

Example

from bs4 import BeautifulSoup
import pandas as pd
import requests
url=&quot;https://bulkfollows.com/services&quot;
soup = BeautifulSoup(requests.get(url).content, &quot;lxml&quot;) 
categories = dict((e.get(&#39;data-filter-category-id&#39;),e.get(&#39;data-filter-category-name&#39;)) for e in soup.select(&#39;.dropdown-menu button[data-filter-category-name]&#39;))
data= []
for e in soup.select(&quot;#serviceList tr:has(td)&quot;):    
    d = dict(zip(e.find_previous(&#39;thead&#39;).stripped_strings,e.stripped_strings))
    d[&#39;Category&#39;] = e.find_previous(&#39;h4&#39;).get_text(strip=True)
    d[&#39;Description&#39;] = e.find(&#39;div&#39;,{&#39;class&#39;:&#39;modal-body&#39;}).get_text(&#39; &#39;,strip=True)
    data.append(d)
pd.DataFrame(data)[[&#39;ID&#39;,  &#39;Service&#39;, &#39;Rate per 1000&#39;, &#39;Min / Max&#39;, &#39;Refill&#39;,&#39;Avg. Time&#39;,&#39;Description&#39;,&#39;Category&#39;]]

Output

	ID	Service	Rate per 1000	Min / Max	Refill	Avg. Time	Description	Category
0	7365	YouTube - Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ]	$4.80	100 / 120000	Refill 30 days	59 hours 53 minutes	Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg Start: Instant - 0 hrs Speed: 500-2k/day Refill: 30 days Drop: 0- 5% drop.	❖ Bulkfollows High Demand Services
1	7363	Spotify - 𝐅𝐑𝐄𝐄 Plays ~ 𝐋𝐢𝐟𝐞𝐓𝐢𝐦𝐞 ~ 10k-50k/days ~ USA/Russian ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ]	$0.188	1000 / 100000000	Refill Lifetime	22 hours 26 minutes	Link: https://open.spotify.com/track/40Zb4FZ6nS1Hj8RVfaLkCV Start: Instant ( Avg 0-3 hrs ) Speed: 10k to 20k days Refill: Lifetime Quality: Plays from Bot Created free accounts. Make sure you know the risk of adding of bot plays Drop: Spotify Plays are stable, do not drop. Delivery Time: It will take 2-5 days to update plays. If it's delivery 10k in 1 day, then this 10k will take 2-5 days to update, the next 10k plays will take the next 2-5 days, and so on.	❖ Bulkfollows High Demand Services
3973	7613	Australia Traffic from Instagram	$0.025	100 / 1000000	No Refill	Not enough data	💡 Use a bit.ly link to track traffic ✅ 100% Real & Unique Visitors ✅ Google Analytics Supported 🕒 Session Length: 40-60 Seconds per visit ⬇️ Bounce Rates: Low ⚡️ Speed: 10,000 unique visitors per day 🏁 Start Time: 0-12h (we check all links for compliance) 🖥️ Desktop Traffic Over 90% 📱 Mobile Traffic Under 10% ⚠️ No Adult, Drug or offensive websites allowed 🔗 Link Format: Enter Full Website URL	⚊ 🇦🇺 Website Traffic from Australia [ + Choose Referrer ]
3974	7614	Australia Traffic from Wikipedia	$0.025	100 / 1000000	No Refill	Not enough data	💡 Use a bit.ly link to track traffic ✅ 100% Real & Unique Visitors ✅ Google Analytics Supported 🕒 Session Length: 40-60 Seconds per visit ⬇️ Bounce Rates: Low ⚡️ Speed: 10,000 unique visitors per day 🏁 Start Time: 0-12h (we check all links for compliance) 🖥️ Desktop Traffic Over 90% 📱 Mobile Traffic Under 10% ⚠️ No Adult, Drug or offensive websites allowed 🔗 Link Format: Enter Full Website URL	⚊ 🇦🇺 Website Traffic from Australia [ + Choose Referrer ]

答案2

得分: 1

您的“Category”列仅包含“None”值的原因是因为soup.select("#serviceList tr:has(td)")找到的元素没有data-filter-table-category-id CSS属性。找到的元素如下所示：

<tr class="">
 <td class="service-id">
  7365
 </td>
 <td class="service-name">
  YouTube - Subscribers ~ Max 120k ~ &#120293;&#120280;&#120281;&#120284;&#120287;&#120287; 30D ~ 500-2k/days ~ [ &#120069;&#120306;&#120424;&#120321; - &#119826;&#119849;&#119838;&#119838;&#119837;, &#119824;&#119854;…                 &#119845;&#119842;&#119853;&#119858;  ]
 </td>
 <td class="service-rate">
  $4.80
 </td>
 <td class="service-min-max">
  100 / 120000
 </td>
 <td class="">
  <span class="badge gurantee">
   Refill 30 days
  </span>
 </td>
 <td class="average-time ser-id-7365">
  63 hours 40 minutes
 </td>
 <td class="text-right service-description">
  <a class="btn btn-sm btn-info" data-target="#description-7365" data-toggle="modal" href="javascript:void(0);">
   <i class="mdi mdi-information">
   </i>
   Details
  </a>
  <!-- Modal -->
  <div aria-hidden="true" aria-labelledby="description7365Label" class="modal fade text-left" id="description-7365" role="dialog" tabindex="-1">
   <div class="modal-dialog" role="document">
    <div class="modal-content">
     <div class="modal-header">
      <h5 class="modal-title" id="description7365Label">
       YouTube - Subscribers ~ Max 120k ~ &#120293;&#120280;&#120281;&#120284;&#120287;&#120287; 30D ~ 500-2k/days ~ [ &#120069;&#120306;&#120424;&#120321; - &#119826;&#119849;&#119838;&#119838;&#119837;,                &#119824;&#119854;&#119834;&#119845;&#119842;&#119853;&#119858; ]'s Description
      </h5>
      <button aria-label="Close" class="close" data-dismiss="modal" type="button">
       <span aria-hidden="true">
        ×
       </span>
      </button>
     </div>
     <div class="modal-body">
      <p style="line-height: 20px;">
       Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg
       <br/>
       Start: Instant - 0 hrs
       <br/>
       Speed: 500-2k/day
       <br/>
       Refill: 30 days
       <br/>
       <br/>
       Drop: 0- 5% drop.
      </p>
     </div>
     <div class="modal-footer">
      <button class="btn btn-primary" data-dismiss="modal" type="button">
       <i class="mdi mdi-close">
       </i>
       Close
      </button>
     </div>
    </div>
   </div>
  </div>
 </td>
</tr>

根据我从您的帖子中了解到的信息，您希望创建一个类似bulkfollows.com上的表，但有三个主要区别：

您的表将是网站上各个表的汇总
您的表将包含额外的列--Category--（其中将包含Service类别ID）
您表的Description列将包含紫色Details按钮后面隐藏的文本。

您自己或其他人可以找出解决您的问题的确切方法；我只会指引您正确的方向。

一般方法：
首先，收集构成单个表的HTML元素。这些是具有类“col-lg-12 mb-3 ser-row”的div元素。

tables = soup.select('div.col-lg-12.mb-3.ser-row')

然后，迭代元素列表。

然后，在每次迭代中：

使用您代码中的相同逻辑。也就是说，创建一个字典，将当前表的列名作为键，值作为值。
获取css属性data-filter-table-category-id的值。创建一个新键，Category，并将css属性的值分配给它。
将字典合并成一个DataFrame（就像您在代码中所做的那样）。

英文:

The reason your Category column only has None values is because the elements that soup.select("#serviceList tr:has(td)") finds do NOT have the css attribute data-filter-table-category-id. The elements its finding are like this:

&lt;tr class=&quot;&quot;&gt;
&lt;td class=&quot;service-id&quot;&gt;
7365
&lt;/td&gt;
&lt;td class=&quot;service-name&quot;&gt;
YouTube - Subscribers ~ Max 120k ~ &#120293;&#120280;&#120281;&#120284;&#120287;&#120287; 30D ~ 500-2k/days ~ [ &#120069;&#120306;&#120424;&#120321; - &#119826;&#119849;&#119838;&#119838;&#119837;, &#119824;&#119854;�                 &#119845;&#119842;&#119853;&#119858;  ]
&lt;/td&gt;
&lt;td class=&quot;service-rate&quot;&gt;
$4.80
&lt;/td&gt;
&lt;td class=&quot;service-min-max&quot;&gt;
100 / 120000
&lt;/td&gt;
&lt;td class=&quot;&quot;&gt;
&lt;span class=&quot;badge gurantee&quot;&gt;
Refill 30 days
&lt;/span&gt;
&lt;/td&gt;
&lt;td class=&quot;average-time ser-id-7365&quot;&gt;
63 hours 40 minutes
&lt;/td&gt;
&lt;td class=&quot;text-right service-description&quot;&gt;
&lt;a class=&quot;btn btn-sm btn-info&quot; data-target=&quot;#description-7365&quot; data-toggle=&quot;modal&quot; href=&quot;javascript:void(0);&quot;&gt;
&lt;i class=&quot;mdi mdi-information&quot;&gt;
&lt;/i&gt;
Details
&lt;/a&gt;
&lt;!-- Modal --&gt;
&lt;div aria-hidden=&quot;true&quot; aria-labelledby=&quot;description7365Label&quot; class=&quot;modal fade text-left&quot; id=&quot;description-7365&quot; role=&quot;dialog&quot; tabindex=&quot;-1&quot;&gt;
&lt;div class=&quot;modal-dialog&quot; role=&quot;document&quot;&gt;
&lt;div class=&quot;modal-content&quot;&gt;
&lt;div class=&quot;modal-header&quot;&gt;
&lt;h5 class=&quot;modal-title&quot; id=&quot;description7365Label&quot;&gt;
YouTube - Subscribers ~ Max 120k ~ &#120293;&#120280;&#120281;&#120284;&#120287;&#120287; 30D ~ 500-2k/days ~ [ &#120069;&#120306;&#120424;&#120321; - &#119826;&#119849;&#119838;&#119838;&#119837;,                &#119824;&#119854;&#119834;&#119845;&#119842;&#119853;&#119858; ]&#39;s Description
&lt;/h5&gt;
&lt;button aria-label=&quot;Close&quot; class=&quot;close&quot; data-dismiss=&quot;modal&quot; type=&quot;button&quot;&gt;
&lt;span aria-hidden=&quot;true&quot;&gt;
&#215;
&lt;/span&gt;
&lt;/button&gt;
&lt;/div&gt;
&lt;div class=&quot;modal-body&quot;&gt;
&lt;p style=&quot;line-height: 20px;&quot;&gt;
Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg
&lt;br/&gt;
Start: Instant - 0 hrs
&lt;br/&gt;
Speed: 500-2k/day
&lt;br/&gt;
Refill: 30 days
&lt;br/&gt;
&lt;br/&gt;
Drop: 0- 5% drop.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;modal-footer&quot;&gt;
&lt;button class=&quot;btn btn-primary&quot; data-dismiss=&quot;modal&quot; type=&quot;button&quot;&gt;
&lt;i class=&quot;mdi mdi-close&quot;&gt;
&lt;/i&gt;
Close
&lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/td&gt;
&lt;/tr&gt;

From what I have deciphered from your post, you want to create a table similar to the ones on bulkfollows.com except for 3 main differences:

Your table will be the aggregate of the tables on the website
Your table will contain an additional column--Category--(which will contain the Service category IDs???)
Your table's Description column will contain the text hidden behind the purple Details buttons.

Yourself or someone else can figure out the precise solution to your problem; I will merely point you in the right direction.

General Approach:

First collect of the HTML elements that make up the individual tables. These are the div elements with the classes col-lg-12 mb-3 ser-row.

tables = soup.select(&#39;div.col-lg-12.mb-3.ser-row&#39;)

Secondly iterate over the list of elements.

Then in each iteration:

use the same logic in your code. That is, create a dictionary with the current table's column names and values as the keys and values, respectively.
Get the value of the css attribute data-filter-table-category-id. Create a new key, Category, and assign the css attr's value to it.
Combine the dict's into a DataFrame (as you did in your code).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用BeautifulSoup如何抓取元素的相关类别？

问题

Example

Output

答案2

General Approach:

LangChain 在 Streamlit 应用程序中与 ConversationBufferMemory 不起作用。

ValueError: DataFrame constructor not properly called! (WebScraping)

如何使用Python的requests模块查找具有特定标签的数据？

LXML 不想解析注释后的文本

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。