使用BeautifulSoup如何抓取元素的相关类别?

huangapple go评论75阅读模式
英文:

How to scrape related category of element using BeautifulSoup?

问题

从这个网站https://bulkfollows.com/services进行网页抓取我想要获取每个服务行其中包含以下特性:`'ID''Service''Rate per 1000''Min / Max''Refill''Avg. Time''Description''category'`。我已经获取了除了category列以外的所有内容category列是一个父特性类似于以下内容

    " YouTube - Watch Time By Length""Instagram - Followers [ From ✓VERIFIED ACCOUNTS]"

这是我的代码
```python
from bs4 import BeautifulSoup
import pandas as pd
import requests

url = "https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml")

categories = dict((e.get('data-filter-category-id'), e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))

data = []
for e in soup.select("#serviceList tr:has(td)"):
    d = dict(zip(e.find_previous('thead').stripped_strings, e.stripped_strings))
    d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
    data.append(d)

pd.DataFrame(data)[['ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill', 'Avg. Time', 'Description', 'category']]

我需要在获取父列时在for循环中进行一些帮助。
这是我的输出:使用BeautifulSoup如何抓取元素的相关类别?

我希望category列不是None,并且当你点击例如第一个服务时,我希望它是:

> > 链接:https://youtube.com/video 开始:0-12小时 速度:100-200每天 补货:30天
>
> 请注意:观看时间将在分析中1-3天内更新。
> 在交付后的3天内,如果观看时间没有更新,请
> 请截取您视频分析的屏幕截图(不是货币化页面,
> 我们不保证货币化)并将其上传到prntscr.com然后
> 将上传的截图发送给我们)。
1: https://i.stack.imgur.com/QLTJp.png


<details>
<summary>英文:</summary>

I am trying to webscrape this site https://bulkfollows.com/services 
What I want is to get every service row which has features like this: `&#39;ID&#39;,  &#39;Service&#39;, &#39;Rate per 1000&#39;, &#39;Min / Max&#39;, &#39;Refill&#39;,&#39;Avg. Time&#39;,&#39;Description&#39;,&#39;category&#39;` I got everything except category column a category column is a parent feature which is like these :

   
    &quot; YouTube - Watch Time By Length&quot; or &quot;Instagram - Followers [ From  ✓VERIFIED ACCOUNTS]&quot;

This is my code :

from bs4 import BeautifulSoup
import pandas as pd
import requests

url="https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml")

categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))

data= []
for e in soup.select("#serviceList tr:has(td)"):
d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
data.append(d)

pd.DataFrame(data)[['ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','category']]

I need some help in the for loop for getting parent columns 
this is my output : [![enter image description here][1]][1]


I want the category column not none and the description when you click for example in first service I want it to be:

&gt; &gt; Link: https://youtube.com/video Start: 0-12hrs Speed: 100-200 Per day Refill: 30 days
&gt; 
&gt; Please Note: Watch time will take 1-3 days to update on analytics.
&gt; After 3 days of delivery, if the watch time does not update, please
&gt; take a screenshot of your video analytic ( Not the Monetization page,
&gt; we don&#39;t guarantee Monetization ) and upload it to prntscr.com and
&gt; send it us the uploaded screenshot ).

  [1]: https://i.stack.imgur.com/QLTJp.png


</details>


# 答案1
**得分**: 2

```markdown
没有适用于所有情况的抓取方法 - 因此,您必须选择更具体的元素,可以查看文档以了解一些[查找策略][1]。

将以下行替换为:

    d[&#39;category&#39;] = categories[e.get(&#39;data-filter-table-category-id&#39;)] if e.get(&#39;data-filter-table-category-id&#39;) else None

使用以下行,它将查找前面的`&lt;h4&gt;`以获取`Category`,并查找下一个模态框以获取`Description`:

    d[&#39;Category&#39;] = e.find_previous(&#39;h4&#39;).get_text(strip=True)
    d[&#39;Description&#39;] = e.find(&#39;div&#39;,{&#39;class&#39;:&#39;modal-body&#39;}).get_text(&#39; &#39;,strip=True)

##### 示例

    from bs4 import BeautifulSoup
    import pandas as pd
    import requests
    
    url=&quot;https://bulkfollows.com/services&quot;
    soup = BeautifulSoup(requests.get(url).content, &quot;lxml&quot;) 
    
    categories = dict((e.get(&#39;data-filter-category-id&#39;),e.get(&#39;data-filter-category-name&#39;)) for e in soup.select(&#39;.dropdown-menu button[data-filter-category-name]&#39;))
    
    data= []
    for e in soup.select(&quot;#serviceList tr:has(td)&quot;):    
        d = dict(zip(e.find_previous(&#39;thead&#39;).stripped_strings,e.stripped_strings))
        d[&#39;Category&#39;] = e.find_previous(&#39;h4&#39;).get_text(strip=True)
        d[&#39;Description&#39;] = e.find(&#39;div&#39;,{&#39;class&#39;:&#39;modal-body&#39;}).get_text(&#39; &#39;,strip=True)
        data.append(d)
    
    pd.DataFrame(data)[[&#39;ID&#39;,  &#39;Service&#39;, &#39;Rate per 1000&#39;, &#39;Min / Max&#39;, &#39;Refill&#39;,&#39;Avg. Time&#39;,&#39;Description&#39;,&#39;Category&#39;]]
英文:

There is no one fits all approach for scraping - So you have to select your elements more specific, may check the docs for some finding strategies.

Replace the line:

d[&#39;category&#39;] = categories[e.get(&#39;data-filter-table-category-id&#39;)] if e.get(&#39;data-filter-table-category-id&#39;) else None

with following, that will take a look to previous &lt;h4&gt; to grab the Category and to the next modal to get the Description:

d[&#39;Category&#39;] = e.find_previous(&#39;h4&#39;).get_text(strip=True)
d[&#39;Description&#39;] = e.find(&#39;div&#39;,{&#39;class&#39;:&#39;modal-body&#39;}).get_text(&#39; &#39;,strip=True)
Example
from bs4 import BeautifulSoup
import pandas as pd
import requests

url=&quot;https://bulkfollows.com/services&quot;
soup = BeautifulSoup(requests.get(url).content, &quot;lxml&quot;) 

categories = dict((e.get(&#39;data-filter-category-id&#39;),e.get(&#39;data-filter-category-name&#39;)) for e in soup.select(&#39;.dropdown-menu button[data-filter-category-name]&#39;))

data= []
for e in soup.select(&quot;#serviceList tr:has(td)&quot;):    
    d = dict(zip(e.find_previous(&#39;thead&#39;).stripped_strings,e.stripped_strings))
    d[&#39;Category&#39;] = e.find_previous(&#39;h4&#39;).get_text(strip=True)
    d[&#39;Description&#39;] = e.find(&#39;div&#39;,{&#39;class&#39;:&#39;modal-body&#39;}).get_text(&#39; &#39;,strip=True)
    data.append(d)

pd.DataFrame(data)[[&#39;ID&#39;,  &#39;Service&#39;, &#39;Rate per 1000&#39;, &#39;Min / Max&#39;, &#39;Refill&#39;,&#39;Avg. Time&#39;,&#39;Description&#39;,&#39;Category&#39;]]
Output
ID Service Rate per 1000 Min / Max Refill Avg. Time Description Category
0 7365 YouTube - Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ] $4.80 100 / 120000 Refill 30 days 59 hours 53 minutes Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg Start: Instant - 0 hrs Speed: 500-2k/day Refill: 30 days Drop: 0- 5% drop. ❖ Bulkfollows High Demand Services
1 7363 Spotify - 𝐅𝐑𝐄𝐄 Plays ~ 𝐋𝐢𝐟𝐞𝐓𝐢𝐦𝐞 ~ 10k-50k/days ~ USA/Russian ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ] $0.188 1000 / 100000000 Refill Lifetime 22 hours 26 minutes Link: https://open.spotify.com/track/40Zb4FZ6nS1Hj8RVfaLkCV Start: Instant ( Avg 0-3 hrs ) Speed: 10k to 20k days Refill: Lifetime Quality: Plays from Bot Created free accounts. Make sure you know the risk of adding of bot plays Drop: Spotify Plays are stable, do not drop. Delivery Time: It will take 2-5 days to update plays. If it's delivery 10k in 1 day, then this 10k will take 2-5 days to update, the next 10k plays will take the next 2-5 days, and so on. ❖ Bulkfollows High Demand Services
3973 7613 Australia Traffic from Instagram $0.025 100 / 1000000 No Refill Not enough data 💡 Use a bit.ly link to track traffic ✅ 100% Real & Unique Visitors ✅ Google Analytics Supported 🕒 Session Length: 40-60 Seconds per visit ⬇️ Bounce Rates: Low ⚡️ Speed: 10,000 unique visitors per day 🏁 Start Time: 0-12h (we check all links for compliance) 🖥️ Desktop Traffic Over 90% 📱 Mobile Traffic Under 10% ⚠️ No Adult, Drug or offensive websites allowed 🔗 Link Format: Enter Full Website URL ⚊ 🇦🇺 Website Traffic from Australia [ + Choose Referrer ]
3974 7614 Australia Traffic from Wikipedia $0.025 100 / 1000000 No Refill Not enough data 💡 Use a bit.ly link to track traffic ✅ 100% Real & Unique Visitors ✅ Google Analytics Supported 🕒 Session Length: 40-60 Seconds per visit ⬇️ Bounce Rates: Low ⚡️ Speed: 10,000 unique visitors per day 🏁 Start Time: 0-12h (we check all links for compliance) 🖥️ Desktop Traffic Over 90% 📱 Mobile Traffic Under 10% ⚠️ No Adult, Drug or offensive websites allowed 🔗 Link Format: Enter Full Website URL ⚊ 🇦🇺 Website Traffic from Australia [ + Choose Referrer ]

答案2

得分: 1

您的“Category”列仅包含“None”值的原因是因为soup.select("#serviceList tr:has(td)")找到的元素没有data-filter-table-category-id CSS属性。找到的元素如下所示:

<tr class="">
 <td class="service-id">
  7365
 </td>
 <td class="service-name">
  YouTube - Subscribers ~ Max 120k ~ &#120293;&#120280;&#120281;&#120284;&#120287;&#120287; 30D ~ 500-2k/days ~ [ &#120069;&#120306;&#120424;&#120321; - &#119826;&#119849;&#119838;&#119838;&#119837;, &#119824;&#119854;…                 &#119845;&#119842;&#119853;&#119858;  ]
 </td>
 <td class="service-rate">
  $4.80
 </td>
 <td class="service-min-max">
  100 / 120000
 </td>
 <td class="">
  <span class="badge gurantee">
   Refill 30 days
  </span>
 </td>
 <td class="average-time ser-id-7365">
  63 hours 40 minutes
 </td>
 <td class="text-right service-description">
  <a class="btn btn-sm btn-info" data-target="#description-7365" data-toggle="modal" href="javascript:void(0);">
   <i class="mdi mdi-information">
   </i>
   Details
  </a>
  <!-- Modal -->
  <div aria-hidden="true" aria-labelledby="description7365Label" class="modal fade text-left" id="description-7365" role="dialog" tabindex="-1">
   <div class="modal-dialog" role="document">
    <div class="modal-content">
     <div class="modal-header">
      <h5 class="modal-title" id="description7365Label">
       YouTube - Subscribers ~ Max 120k ~ &#120293;&#120280;&#120281;&#120284;&#120287;&#120287; 30D ~ 500-2k/days ~ [ &#120069;&#120306;&#120424;&#120321; - &#119826;&#119849;&#119838;&#119838;&#119837;,                &#119824;&#119854;&#119834;&#119845;&#119842;&#119853;&#119858; ]'s Description
      </h5>
      <button aria-label="Close" class="close" data-dismiss="modal" type="button">
       <span aria-hidden="true">
        ×
       </span>
      </button>
     </div>
     <div class="modal-body">
      <p style="line-height: 20px;">
       Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg
       <br/>
       Start: Instant - 0 hrs
       <br/>
       Speed: 500-2k/day
       <br/>
       Refill: 30 days
       <br/>
       <br/>
       Drop: 0- 5% drop.
      </p>
     </div>
     <div class="modal-footer">
      <button class="btn btn-primary" data-dismiss="modal" type="button">
       <i class="mdi mdi-close">
       </i>
       Close
      </button>
     </div>
    </div>
   </div>
  </div>
 </td>
</tr>

根据我从您的帖子中了解到的信息,您希望创建一个类似bulkfollows.com上的表,但有三个主要区别:

  1. 您的表将是网站上各个表的汇总

  2. 您的表将包含额外的列--Category--(其中将包含Service类别ID)

  3. 您表的Description列将包含紫色Details按钮后面隐藏的文本。

您自己或其他人可以找出解决您的问题的确切方法;我只会指引您正确的方向。

一般方法:
首先,收集构成单个表的HTML元素。这些是具有类“col-lg-12 mb-3 ser-row”的div元素。

tables = soup.select('div.col-lg-12.mb-3.ser-row')

然后,迭代元素列表。

然后,在每次迭代中:

  1. 使用您代码中的相同逻辑。也就是说,创建一个字典,将当前表的列名作为键,值作为值。

  2. 获取css属性data-filter-table-category-id的值。创建一个新键,Category,并将css属性的值分配给它。

  3. 将字典合并成一个DataFrame(就像您在代码中所做的那样)。

英文:

The reason your Category column only has None values is because the elements that soup.select(&quot;#serviceList tr:has(td)&quot;) finds do NOT have the css attribute data-filter-table-category-id. The elements its finding are like this:

&lt;tr class=&quot;&quot;&gt;
&lt;td class=&quot;service-id&quot;&gt;
7365
&lt;/td&gt;
&lt;td class=&quot;service-name&quot;&gt;
YouTube - Subscribers ~ Max 120k ~ &#120293;&#120280;&#120281;&#120284;&#120287;&#120287; 30D ~ 500-2k/days ~ [ &#120069;&#120306;&#120424;&#120321; - &#119826;&#119849;&#119838;&#119838;&#119837;, &#119824;&#119854;�                 &#119845;&#119842;&#119853;&#119858;  ]
&lt;/td&gt;
&lt;td class=&quot;service-rate&quot;&gt;
$4.80
&lt;/td&gt;
&lt;td class=&quot;service-min-max&quot;&gt;
100 / 120000
&lt;/td&gt;
&lt;td class=&quot;&quot;&gt;
&lt;span class=&quot;badge gurantee&quot;&gt;
Refill 30 days
&lt;/span&gt;
&lt;/td&gt;
&lt;td class=&quot;average-time ser-id-7365&quot;&gt;
63 hours 40 minutes
&lt;/td&gt;
&lt;td class=&quot;text-right service-description&quot;&gt;
&lt;a class=&quot;btn btn-sm btn-info&quot; data-target=&quot;#description-7365&quot; data-toggle=&quot;modal&quot; href=&quot;javascript:void(0);&quot;&gt;
&lt;i class=&quot;mdi mdi-information&quot;&gt;
&lt;/i&gt;
Details
&lt;/a&gt;
&lt;!-- Modal --&gt;
&lt;div aria-hidden=&quot;true&quot; aria-labelledby=&quot;description7365Label&quot; class=&quot;modal fade text-left&quot; id=&quot;description-7365&quot; role=&quot;dialog&quot; tabindex=&quot;-1&quot;&gt;
&lt;div class=&quot;modal-dialog&quot; role=&quot;document&quot;&gt;
&lt;div class=&quot;modal-content&quot;&gt;
&lt;div class=&quot;modal-header&quot;&gt;
&lt;h5 class=&quot;modal-title&quot; id=&quot;description7365Label&quot;&gt;
YouTube - Subscribers ~ Max 120k ~ &#120293;&#120280;&#120281;&#120284;&#120287;&#120287; 30D ~ 500-2k/days ~ [ &#120069;&#120306;&#120424;&#120321; - &#119826;&#119849;&#119838;&#119838;&#119837;,                &#119824;&#119854;&#119834;&#119845;&#119842;&#119853;&#119858; ]&#39;s Description
&lt;/h5&gt;
&lt;button aria-label=&quot;Close&quot; class=&quot;close&quot; data-dismiss=&quot;modal&quot; type=&quot;button&quot;&gt;
&lt;span aria-hidden=&quot;true&quot;&gt;
&#215;
&lt;/span&gt;
&lt;/button&gt;
&lt;/div&gt;
&lt;div class=&quot;modal-body&quot;&gt;
&lt;p style=&quot;line-height: 20px;&quot;&gt;
Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg
&lt;br/&gt;
Start: Instant - 0 hrs
&lt;br/&gt;
Speed: 500-2k/day
&lt;br/&gt;
Refill: 30 days
&lt;br/&gt;
&lt;br/&gt;
Drop: 0- 5% drop.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&quot;modal-footer&quot;&gt;
&lt;button class=&quot;btn btn-primary&quot; data-dismiss=&quot;modal&quot; type=&quot;button&quot;&gt;
&lt;i class=&quot;mdi mdi-close&quot;&gt;
&lt;/i&gt;
Close
&lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/td&gt;
&lt;/tr&gt;

From what I have deciphered from your post, you want to create a table similar to the ones on bulkfollows.com except for 3 main differences:

  1. Your table will be the aggregate of the tables on the website

  2. Your table will contain an additional column--Category--(which will contain the Service category IDs???)

  3. Your table's Description column will contain the text hidden behind the purple Details buttons.

Yourself or someone else can figure out the precise solution to your problem; I will merely point you in the right direction.

General Approach:

First collect of the HTML elements that make up the individual tables. These are the div elements with the classes col-lg-12 mb-3 ser-row.

tables = soup.select(&#39;div.col-lg-12.mb-3.ser-row&#39;)

Secondly iterate over the list of elements.

Then in each iteration:

  1. use the same logic in your code. That is, create a dictionary with the current table's column names and values as the keys and values, respectively.

  2. Get the value of the css attribute data-filter-table-category-id. Create a new key, Category, and assign the css attr's value to it.

  3. Combine the dict's into a DataFrame (as you did in your code).

huangapple
  • 本文由 发表于 2023年3月10日 01:22:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75688015.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定