2023年2月18日 12:51:44go评论98阅读模式

英文:

Is there a way to specifically web scrape and get the data of heights that is not listed in text?

问题

我正在网页抓取一些列出的运动员的身高。我已经编写了代码来获取身高，但在检查元素之后，我注意到在文本下面，身高以英尺表示，但在"data-sort"属性中，身高以英寸表示。这两者都在"class"属性为"height"的"td"标签中。但是，当我使用"get_text()"或".text"来删除HTML元素时，它只打印出英尺中的身高，并删除了隐藏的英寸部分。有没有办法可以获取以英寸表示的身高，因为这将使数学计算更容易。

以下是示例我正在网页抓取的内容，我想删除一切，只获取以英寸表示的身高，即[79, 85, 74...]在这种情况下。

<td class="height" data-sort="79">6-7</td>
<td class="height" data-sort="85">7-1</td>
<td class="height" data-sort="74">6-2</td>

# 这是我的代码
from bs4 import BeautifulSoup
import requests 
urls=['https://goduke.com/sports/mens-basketball/roster']
ListData=[]
for x in range(len(urls)):
    page=requests.get(urls[x]).text
    pagesoup=BeautifulSoup(page,'html.parser')
    h=pagesoup.find_all('td', class_="height")
    ListData.append(h)
NewList=[]
for b in range(len(ListData)):
    new=[]
    for x in ListData[b]:
        print(x.text)

[注：以上是代码和文本的翻译。]

英文:

I'm web scraping a bunch of heights for listed athletes. I have written the code to get the heights but after inspecting element, I noticed that under text the height is written in feet, but in "data-sort" that height is listed in inches. Both of these are in the td tag in class "heights". However when I use "get_text()" or .text to remove the html elements it only prints out the height in feet and removes the hidden height in inches. Is there a way I can get the height listed in inches because that will make it easier to the do math.

Here is an example of what I'm web scraping, I want remove everything and only get the height in inches which will be [79,85,74... in this case.

&lt;td class=&quot;height&quot; data-sort=&quot;79&quot;&gt;6-7&lt;/td&gt;
&lt;td class=&quot;height&quot; data-sort=&quot;85&quot;&gt;7-1&lt;/td&gt;
&lt;td class=&quot;height&quot; data-sort=&quot;74&quot;&gt;6-2&lt;/td&gt;

#This is my code
from bs4 import BeautifulSoup
import requests 
urls=[&#39;https://goduke.com/sports/mens-basketball/roster&#39;]
ListData=[]
for x in range(len(urls)):
    page=requests.get(urls[x]).text
    pagesoup=BeautifulSoup(page,&#39;html.parser&#39;)
    h=pagesoup.find_all(&#39;td&#39;, class_=&quot;height&quot;)
    ListData.append(h)
NewList=[]
for b in range(len(ListData)):
    new=[]
    for x in ListData[b]:
        print(x.text)

答案1

得分: 0

如果您使用CSS选择器，您可以简单地传递第一个类名。

from scrapy.selector import Selector

英文:

If you use css selector you can simply pass the first class name.

from scrapy.selector import Selector

答案2

得分: 0

from bs4 import BeautifulSoup
import requests 
urls=['https://goduke.com/sports/mens-basketball/roster']
ListData=[]
for url in urls:
    page=requests.get(url).text
    pagesoup=BeautifulSoup(page,'html.parser')
    tds = pagesoup.select('td.height[data-sort]')
    for td in tds:
        ListData.append(td.attrs['data-sort'])
print(ListData)

output

['79', '85', '74', '74', '77', '77', '78', '77', '82', '85', '80', '84', '77', '84', '68']

英文:

from bs4 import BeautifulSoup
import requests 
urls=[&#39;https://goduke.com/sports/mens-basketball/roster&#39;]
ListData=[]
for url in urls:
    page=requests.get(url).text
    pagesoup=BeautifulSoup(page,&#39;html.parser&#39;)
    tds = pagesoup.select(&#39;td.height[data-sort]&#39;)
    for td in tds:
        ListData.append(td.attrs[&#39;data-sort&#39;])
print(ListData)

output

[&#39;79&#39;, &#39;85&#39;, &#39;74&#39;, &#39;74&#39;, &#39;77&#39;, &#39;77&#39;, &#39;78&#39;, &#39;77&#39;, &#39;82&#39;, &#39;85&#39;, &#39;80&#39;, &#39;84&#39;, &#39;77&#39;, &#39;84&#39;, &#39;68&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Is there a way to specifically web scrape and get the data of heights that is not listed in text?

问题

答案1

答案2

获取到 ValueError: 时间数据与格式“%Y-%m-%d %H:%M:%S.%f%z”不匹配的错误。

Installing GDAL for python in Google Cloud Functions — error when deploying

在使用Ember.js的HBS模板中设置变量。

匹配给定单词中仅包含字母的正则表达式： “`regex ^[a-zA-Z]+$ “`

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。