英文:
How to scrape related category of element using BeautifulSoup?
问题
从这个网站https://bulkfollows.com/services进行网页抓取。我想要获取每个服务行,其中包含以下特性:`'ID','Service','Rate per 1000','Min / Max','Refill','Avg. Time','Description','category'`。我已经获取了除了category列以外的所有内容。category列是一个父特性,类似于以下内容:
" YouTube - Watch Time By Length" 或 "Instagram - Followers [ From ✓VERIFIED ACCOUNTS]"
这是我的代码:
```python
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml")
categories = dict((e.get('data-filter-category-id'), e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))
data = []
for e in soup.select("#serviceList tr:has(td)"):
d = dict(zip(e.find_previous('thead').stripped_strings, e.stripped_strings))
d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
data.append(d)
pd.DataFrame(data)[['ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill', 'Avg. Time', 'Description', 'category']]
我需要在获取父列时在for循环中进行一些帮助。
这是我的输出:
我希望category列不是None,并且当你点击例如第一个服务时,我希望它是:
> > 链接:https://youtube.com/video 开始:0-12小时 速度:100-200每天 补货:30天
>
> 请注意:观看时间将在分析中1-3天内更新。
> 在交付后的3天内,如果观看时间没有更新,请
> 请截取您视频分析的屏幕截图(不是货币化页面,
> 我们不保证货币化)并将其上传到prntscr.com然后
> 将上传的截图发送给我们)。
1: https://i.stack.imgur.com/QLTJp.png
<details>
<summary>英文:</summary>
I am trying to webscrape this site https://bulkfollows.com/services
What I want is to get every service row which has features like this: `'ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','category'` I got everything except category column a category column is a parent feature which is like these :
" YouTube - Watch Time By Length" or "Instagram - Followers [ From ✓VERIFIED ACCOUNTS]"
This is my code :
from bs4 import BeautifulSoup
import pandas as pd
import requests
url="https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml")
categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))
data= []
for e in soup.select("#serviceList tr:has(td)"):
d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
data.append(d)
pd.DataFrame(data)[['ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','category']]
I need some help in the for loop for getting parent columns
this is my output : [![enter image description here][1]][1]
I want the category column not none and the description when you click for example in first service I want it to be:
> > Link: https://youtube.com/video Start: 0-12hrs Speed: 100-200 Per day Refill: 30 days
>
> Please Note: Watch time will take 1-3 days to update on analytics.
> After 3 days of delivery, if the watch time does not update, please
> take a screenshot of your video analytic ( Not the Monetization page,
> we don't guarantee Monetization ) and upload it to prntscr.com and
> send it us the uploaded screenshot ).
[1]: https://i.stack.imgur.com/QLTJp.png
</details>
# 答案1
**得分**: 2
```markdown
没有适用于所有情况的抓取方法 - 因此,您必须选择更具体的元素,可以查看文档以了解一些[查找策略][1]。
将以下行替换为:
d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
使用以下行,它将查找前面的`<h4>`以获取`Category`,并查找下一个模态框以获取`Description`:
d['Category'] = e.find_previous('h4').get_text(strip=True)
d['Description'] = e.find('div',{'class':'modal-body'}).get_text(' ',strip=True)
##### 示例
from bs4 import BeautifulSoup
import pandas as pd
import requests
url="https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml")
categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))
data= []
for e in soup.select("#serviceList tr:has(td)"):
d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
d['Category'] = e.find_previous('h4').get_text(strip=True)
d['Description'] = e.find('div',{'class':'modal-body'}).get_text(' ',strip=True)
data.append(d)
pd.DataFrame(data)[['ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','Category']]
英文:
There is no one fits all approach for scraping - So you have to select your elements more specific, may check the docs for some finding strategies.
Replace the line:
d['category'] = categories[e.get('data-filter-table-category-id')] if e.get('data-filter-table-category-id') else None
with following, that will take a look to previous <h4>
to grab the Category
and to the next modal to get the Description
:
d['Category'] = e.find_previous('h4').get_text(strip=True)
d['Description'] = e.find('div',{'class':'modal-body'}).get_text(' ',strip=True)
Example
from bs4 import BeautifulSoup
import pandas as pd
import requests
url="https://bulkfollows.com/services"
soup = BeautifulSoup(requests.get(url).content, "lxml")
categories = dict((e.get('data-filter-category-id'),e.get('data-filter-category-name')) for e in soup.select('.dropdown-menu button[data-filter-category-name]'))
data= []
for e in soup.select("#serviceList tr:has(td)"):
d = dict(zip(e.find_previous('thead').stripped_strings,e.stripped_strings))
d['Category'] = e.find_previous('h4').get_text(strip=True)
d['Description'] = e.find('div',{'class':'modal-body'}).get_text(' ',strip=True)
data.append(d)
pd.DataFrame(data)[['ID', 'Service', 'Rate per 1000', 'Min / Max', 'Refill','Avg. Time','Description','Category']]
Output
ID | Service | Rate per 1000 | Min / Max | Refill | Avg. Time | Description | Category | |
---|---|---|---|---|---|---|---|---|
0 | 7365 | YouTube - Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ] | $4.80 | 100 / 120000 | Refill 30 days | 59 hours 53 minutes | Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg Start: Instant - 0 hrs Speed: 500-2k/day Refill: 30 days Drop: 0- 5% drop. | ❖ Bulkfollows High Demand Services |
1 | 7363 | Spotify - 𝐅𝐑𝐄𝐄 Plays ~ 𝐋𝐢𝐟𝐞𝐓𝐢𝐦𝐞 ~ 10k-50k/days ~ USA/Russian ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ] | $0.188 | 1000 / 100000000 | Refill Lifetime | 22 hours 26 minutes | Link: https://open.spotify.com/track/40Zb4FZ6nS1Hj8RVfaLkCV Start: Instant ( Avg 0-3 hrs ) Speed: 10k to 20k days Refill: Lifetime Quality: Plays from Bot Created free accounts. Make sure you know the risk of adding of bot plays Drop: Spotify Plays are stable, do not drop. Delivery Time: It will take 2-5 days to update plays. If it's delivery 10k in 1 day, then this 10k will take 2-5 days to update, the next 10k plays will take the next 2-5 days, and so on. | ❖ Bulkfollows High Demand Services |
3973 | 7613 | Australia Traffic from Instagram | $0.025 | 100 / 1000000 | No Refill | Not enough data | 💡 Use a bit.ly link to track traffic ✅ 100% Real & Unique Visitors ✅ Google Analytics Supported 🕒 Session Length: 40-60 Seconds per visit ⬇️ Bounce Rates: Low ⚡️ Speed: 10,000 unique visitors per day 🏁 Start Time: 0-12h (we check all links for compliance) 🖥️ Desktop Traffic Over 90% 📱 Mobile Traffic Under 10% ⚠️ No Adult, Drug or offensive websites allowed 🔗 Link Format: Enter Full Website URL | ⚊ 🇦🇺 Website Traffic from Australia [ + Choose Referrer ] |
3974 | 7614 | Australia Traffic from Wikipedia | $0.025 | 100 / 1000000 | No Refill | Not enough data | 💡 Use a bit.ly link to track traffic ✅ 100% Real & Unique Visitors ✅ Google Analytics Supported 🕒 Session Length: 40-60 Seconds per visit ⬇️ Bounce Rates: Low ⚡️ Speed: 10,000 unique visitors per day 🏁 Start Time: 0-12h (we check all links for compliance) 🖥️ Desktop Traffic Over 90% 📱 Mobile Traffic Under 10% ⚠️ No Adult, Drug or offensive websites allowed 🔗 Link Format: Enter Full Website URL | ⚊ 🇦🇺 Website Traffic from Australia [ + Choose Referrer ] |
答案2
得分: 1
您的“Category”列仅包含“None”值的原因是因为soup.select("#serviceList tr:has(td)")
找到的元素没有data-filter-table-category-id
CSS属性。找到的元素如下所示:
<tr class="">
<td class="service-id">
7365
</td>
<td class="service-name">
YouTube - Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮… 𝐥𝐢𝐭𝐲 ]
</td>
<td class="service-rate">
$4.80
</td>
<td class="service-min-max">
100 / 120000
</td>
<td class="">
<span class="badge gurantee">
Refill 30 days
</span>
</td>
<td class="average-time ser-id-7365">
63 hours 40 minutes
</td>
<td class="text-right service-description">
<a class="btn btn-sm btn-info" data-target="#description-7365" data-toggle="modal" href="javascript:void(0);">
<i class="mdi mdi-information">
</i>
Details
</a>
<!-- Modal -->
<div aria-hidden="true" aria-labelledby="description7365Label" class="modal fade text-left" id="description-7365" role="dialog" tabindex="-1">
<div class="modal-dialog" role="document">
<div class="modal-content">
<div class="modal-header">
<h5 class="modal-title" id="description7365Label">
YouTube - Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ]'s Description
</h5>
<button aria-label="Close" class="close" data-dismiss="modal" type="button">
<span aria-hidden="true">
×
</span>
</button>
</div>
<div class="modal-body">
<p style="line-height: 20px;">
Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg
<br/>
Start: Instant - 0 hrs
<br/>
Speed: 500-2k/day
<br/>
Refill: 30 days
<br/>
<br/>
Drop: 0- 5% drop.
</p>
</div>
<div class="modal-footer">
<button class="btn btn-primary" data-dismiss="modal" type="button">
<i class="mdi mdi-close">
</i>
Close
</button>
</div>
</div>
</div>
</div>
</td>
</tr>
根据我从您的帖子中了解到的信息,您希望创建一个类似bulkfollows.com上的表,但有三个主要区别:
-
您的表将是网站上各个表的汇总
-
您的表将包含额外的列--Category--(其中将包含Service类别ID)
-
您表的Description列将包含紫色Details按钮后面隐藏的文本。
您自己或其他人可以找出解决您的问题的确切方法;我只会指引您正确的方向。
一般方法:
首先,收集构成单个表的HTML元素。这些是具有类“col-lg-12 mb-3 ser-row”的div元素。
tables = soup.select('div.col-lg-12.mb-3.ser-row')
然后,迭代元素列表。
然后,在每次迭代中:
-
使用您代码中的相同逻辑。也就是说,创建一个字典,将当前表的列名作为键,值作为值。
-
获取css属性data-filter-table-category-id的值。创建一个新键,Category,并将css属性的值分配给它。
-
将字典合并成一个DataFrame(就像您在代码中所做的那样)。
英文:
The reason your Category column only has None
values is because the elements that soup.select("#serviceList tr:has(td)")
finds do NOT have the css attribute data-filter-table-category-id
. The elements its finding are like this:
<tr class="">
<td class="service-id">
7365
</td>
<td class="service-name">
YouTube - Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮� 𝐥𝐢𝐭𝐲 ]
</td>
<td class="service-rate">
$4.80
</td>
<td class="service-min-max">
100 / 120000
</td>
<td class="">
<span class="badge gurantee">
Refill 30 days
</span>
</td>
<td class="average-time ser-id-7365">
63 hours 40 minutes
</td>
<td class="text-right service-description">
<a class="btn btn-sm btn-info" data-target="#description-7365" data-toggle="modal" href="javascript:void(0);">
<i class="mdi mdi-information">
</i>
Details
</a>
<!-- Modal -->
<div aria-hidden="true" aria-labelledby="description7365Label" class="modal fade text-left" id="description-7365" role="dialog" tabindex="-1">
<div class="modal-dialog" role="document">
<div class="modal-content">
<div class="modal-header">
<h5 class="modal-title" id="description7365Label">
YouTube - Subscribers ~ Max 120k ~ 𝗥𝗘𝗙𝗜𝗟𝗟 30D ~ 500-2k/days ~ [ 𝔅𝗲𝙨𝘁 - 𝐒𝐩𝐞𝐞𝐝, 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 ]'s Description
</h5>
<button aria-label="Close" class="close" data-dismiss="modal" type="button">
<span aria-hidden="true">
×
</span>
</button>
</div>
<div class="modal-body">
<p style="line-height: 20px;">
Link: https://www.youtube.com/channel/UCYhvmzYNxCAGBaMhnsk69kg
<br/>
Start: Instant - 0 hrs
<br/>
Speed: 500-2k/day
<br/>
Refill: 30 days
<br/>
<br/>
Drop: 0- 5% drop.
</p>
</div>
<div class="modal-footer">
<button class="btn btn-primary" data-dismiss="modal" type="button">
<i class="mdi mdi-close">
</i>
Close
</button>
</div>
</div>
</div>
</div>
</td>
</tr>
From what I have deciphered from your post, you want to create a table similar to the ones on bulkfollows.com except for 3 main differences:
-
Your table will be the aggregate of the tables on the website
-
Your table will contain an additional column--Category--(which will contain the Service category IDs???)
-
Your table's Description column will contain the text hidden behind the purple Details buttons.
Yourself or someone else can figure out the precise solution to your problem; I will merely point you in the right direction.
General Approach:
First collect of the HTML elements that make up the individual tables. These are the div elements with the classes col-lg-12 mb-3 ser-row
.
tables = soup.select('div.col-lg-12.mb-3.ser-row')
Secondly iterate over the list of elements.
Then in each iteration:
-
use the same logic in your code. That is, create a dictionary with the current table's column names and values as the keys and values, respectively.
-
Get the value of the css attribute data-filter-table-category-id. Create a new key, Category, and assign the css attr's value to it.
-
Combine the dict's into a DataFrame (as you did in your code).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论