Is there a way to specifically web scrape and get the data of heights that is not listed in text?

huangapple go评论67阅读模式
英文:

Is there a way to specifically web scrape and get the data of heights that is not listed in text?

问题

我正在网页抓取一些列出的运动员的身高。我已经编写了代码来获取身高,但在检查元素之后,我注意到在文本下面,身高以英尺表示,但在"data-sort"属性中,身高以英寸表示。这两者都在"class"属性为"height"的"td"标签中。但是,当我使用"get_text()"或".text"来删除HTML元素时,它只打印出英尺中的身高,并删除了隐藏的英寸部分。有没有办法可以获取以英寸表示的身高,因为这将使数学计算更容易。

以下是示例我正在网页抓取的内容,我想删除一切,只获取以英寸表示的身高,即[79, 85, 74...]在这种情况下。

<td class="height" data-sort="79">6-7</td>
<td class="height" data-sort="85">7-1</td>
<td class="height" data-sort="74">6-2</td>
# 这是我的代码

from bs4 import BeautifulSoup
import requests 

urls=['https://goduke.com/sports/mens-basketball/roster']

ListData=[]
for x in range(len(urls)):
    page=requests.get(urls[x]).text
    pagesoup=BeautifulSoup(page,'html.parser')
    h=pagesoup.find_all('td', class_="height")
    ListData.append(h)
NewList=[]
for b in range(len(ListData)):
    new=[]
    for x in ListData[b]:
        print(x.text)

[注:以上是代码和文本的翻译。]

英文:

I'm web scraping a bunch of heights for listed athletes. I have written the code to get the heights but after inspecting element, I noticed that under text the height is written in feet, but in "data-sort" that height is listed in inches. Both of these are in the td tag in class "heights". However when I use "get_text()" or .text to remove the html elements it only prints out the height in feet and removes the hidden height in inches. Is there a way I can get the height listed in inches because that will make it easier to the do math.

Here is an example of what I'm web scraping, I want remove everything and only get the height in inches which will be [79,85,74... in this case.

&lt;td class=&quot;height&quot; data-sort=&quot;79&quot;&gt;6-7&lt;/td&gt;
&lt;td class=&quot;height&quot; data-sort=&quot;85&quot;&gt;7-1&lt;/td&gt;
&lt;td class=&quot;height&quot; data-sort=&quot;74&quot;&gt;6-2&lt;/td&gt;
#This is my code

from bs4 import BeautifulSoup
import requests 

urls=[&#39;https://goduke.com/sports/mens-basketball/roster&#39;]

ListData=[]
for x in range(len(urls)):
    page=requests.get(urls[x]).text
    pagesoup=BeautifulSoup(page,&#39;html.parser&#39;)
    h=pagesoup.find_all(&#39;td&#39;, class_=&quot;height&quot;)
    ListData.append(h)
NewList=[]
for b in range(len(ListData)):
    new=[]
    for x in ListData[b]:
        print(x.text)

答案1

得分: 0

如果您使用CSS选择器,您可以简单地传递第一个类名。

from scrapy.selector import Selector

英文:

If you use css selector you can simply pass the first class name.

from scrapy.selector import Selector

答案2

得分: 0

from bs4 import BeautifulSoup
import requests 

urls=['https://goduke.com/sports/mens-basketball/roster']

ListData=[]

for url in urls:
    page=requests.get(url).text
    pagesoup=BeautifulSoup(page,'html.parser')
    tds = pagesoup.select('td.height[data-sort]')
    for td in tds:
        ListData.append(td.attrs['data-sort'])
print(ListData)

output

['79', '85', '74', '74', '77', '77', '78', '77', '82', '85', '80', '84', '77', '84', '68']
英文:
from bs4 import BeautifulSoup
import requests 

urls=[&#39;https://goduke.com/sports/mens-basketball/roster&#39;]

ListData=[]

for url in urls:
    page=requests.get(url).text
    pagesoup=BeautifulSoup(page,&#39;html.parser&#39;)
    tds = pagesoup.select(&#39;td.height[data-sort]&#39;)
    for td in tds:
        ListData.append(td.attrs[&#39;data-sort&#39;])
print(ListData)

output

[&#39;79&#39;, &#39;85&#39;, &#39;74&#39;, &#39;74&#39;, &#39;77&#39;, &#39;77&#39;, &#39;78&#39;, &#39;77&#39;, &#39;82&#39;, &#39;85&#39;, &#39;80&#39;, &#39;84&#39;, &#39;77&#39;, &#39;84&#39;, &#39;68&#39;]

huangapple
  • 本文由 发表于 2023年2月18日 12:51:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/75491269.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定