如何从HTML中提取特定数据?

huangapple go评论63阅读模式
英文:

How to extract specific data from HTML?

问题

这个代码的问题在于它假定"✅"符号总是出现在Jogos时间列表中,并且总是在第一个Jogo时间后。但实际情况可能是"✅"符号出现在任何一个Jogo时间后,也可能没有任何"✅"符号。

要解决这个问题,你可以使用以下方式来提取数据:

from bs4 import BeautifulSoup

html = """
<div class="body">
   <div class="pull_right date details" title="09.03.2023 01:08:10 UTC-03:00">
      01:08
   </div>
   <div class="from_name">
      &#129302;&#129351; &#119916;&#119938;&#119956;&#119962; &#119913;&#119952;&#119957; - &#119926;&#119959;&#119942;&#119955; 2.5 
   </div>
   <div class="text">
      Easy Bot - Over 2.5<br><br>⚽ Liga: Sul-Americana<br>⚽ Entrada: Over 2.5 FT<br>⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>24h:100% de acerto nas últimas 24h<br><br>✅✅✅✅✅✅ .
   </div>
</div>
"""

# 解析HTML
soup = BeautifulSoup(html, 'html.parser')

d_str = soup.select_one('div.date.details')['title']
calendar = d_str.split(" ")
print("Date:", calendar[0])
print("Time:", calendar[1])

text_div = soup.select_one('div.text')
for sts in text_div.stripped_strings:
    if "Jogos: ✅" in sts:
        jogos_info = sts.split("Jogos: ✅")[1]
        jogos = jogos_info.split()
        checkmarked = jogos.index('04:10') + 1 if '04:10' in jogos else 0
        print("Jogo 1:", jogos[0])
        print("Checkmarked:", checkmarked)

这段代码会尝试从文本中提取Jogos时间信息,并找出"✅"符号的位置。如果没有"✅"符号,它会将Checkmarked设置为0。这样,无论"✅"符号出现在哪个Jogo时间后,都能正确提取数据。

英文:

I need to extract from HTML the following data: Date, Time, the first Jogo and, if the checkmark symbol is present on the first, second, third or fourth time or display 0 if there's no checkmark.

The returns fine all data when the checkmark is present, but when there's no checkmark it just displays the Date and Time.

My python code is:


from bs4 import BeautifulSoup

html = &quot;&quot;&quot;
&lt;div class=&quot;body&quot;&gt;
   &lt;div class=&quot;pull_right date details&quot; title=&quot;09.03.2023 01:08:10 UTC-03:00&quot;&gt;
      01:08
   &lt;/div&gt;
   &lt;div class=&quot;from_name&quot;&gt;
      &#129302;&#129351; &#119916;&#119938;&#119956;&#119962; &#119913;&#119952;&#119957; - &#119926;&#119959;&#119942;&#119955; 2.5 
   &lt;/div&gt;
   &lt;div class=&quot;text&quot;&gt;
      Easy Bot - Over 2.5&lt;br&gt;&lt;br&gt;&#127942; Liga: Sul-Americana&lt;br&gt;&#128678; Entrada: Over 2.5 FT&lt;br&gt;⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)&lt;br&gt;&lt;br&gt;&lt;strong&gt;Link: &lt;/strong&gt;&lt;a href=&quot;https://www.bet365.com/#/AVR/B146/R%5E1/&quot;&gt;https://www.bet365.com/#/AVR/B146/R%5E1/&lt;/a&gt;&lt;br&gt;&lt;br&gt;&#127808; 24h:100% de acerto nas &#250;ltimas 24h&lt;br&gt;&lt;br&gt;✅✅✅✅✅✅ .
   &lt;/div&gt;
&lt;/div&gt;
&quot;&quot;&quot;

# parse the HTML
soup = BeautifulSoup(html, &#39;html.parser&#39;)

d_str = soup.select_one(&#39;div.date.details&#39;)[&#39;title&#39;]
calendar = d_str.split(&quot; &quot;)
print(&quot;Date: &quot;,calendar[0])
print(&quot;Time: &quot;,calendar[1])
for sts in soup.select_one(&#39;div.text&#39;).stripped_strings:
    if &quot;⚽ Jogos: &quot; in sts:
        jugos = (sts.split(&#39;⚽ Jogos: &#39;)[1].split(&quot; &quot;))
        ind = jugos.index(&#39;✅&#39;)+1
        jugos.remove(&quot;✅&quot;)
        print(&quot;Jogo 1: &quot;, jugos[0])
        print(&quot;Checkmarked: &quot;, ind)

Note that the checkmark "belongs" to the time. So it can be on the first, second, third or fourth Jogos time.

But sometimes there will be no checkmark. when this happens the output should be:

Date: XXX
Time: XXX
Jogo 1: XXX
Checkmarked: 0

So, what's the problem with this code?

答案1

得分: 1

我会这样做:

d_str = soup.select_one('div.date.details')['title']
calendar = d_str.split(" ")
print("日期: ", calendar[0])
print("时间: ", calendar[1])
for sts in soup.select_one('div.text').stripped_strings:
    if "⚽ Jogos: " in sts:
        jogos = (sts.split('⚽ Jogos: ')[1].split(" "))
        if "✅" in jogos:
            ind = jogos.index('✅')+1
            print("已标记: ", ind)
            jogos.remove("✅")
            print(jogos)
    else:
        print(jogos)
        print("已标记: 无")

带有检查标记的输出:

日期: 09.03.2023
时间: 01:08:10
['04:10', '04:13', '04:16', '(04:19)']
已标记: 1

没有检查标记的输出:

日期: 09.03.2023
时间: 01:08:10
['04:10', '04:13', '04:16', '(04:19)']
已标记: 无

当然,你可以将输出添加到列表等其他数据结构中,而不是直接打印出来。

英文:

I would do it this way:

d_str = soup.select_one(&#39;div.date.details&#39;)[&#39;title&#39;]
calendar = d_str.split(&quot; &quot;)
print(&quot;Date: &quot;,calendar[0])
print(&quot;Time: &quot;,calendar[1])
for sts in soup.select_one(&#39;div.text&#39;).stripped_strings:
    if &quot;⚽ Jogos: &quot; in sts:
        jugos = (sts.split(&#39;⚽ Jogos: &#39;)[1].split(&quot; &quot;))
        if &quot;✅&quot; in jugos:
           ind = jugos.index(&#39;✅&#39;)+1
           print(&quot;Checkmarked: &quot;, ind)
           jugos.remove(&quot;✅&quot;)
           print(jugos)
    else:
        print(jugos)
        print(&quot;Checkmarked: NA&quot;)

Output with check mark:

Date:  09.03.2023
Time:  01:08:10
[&#39;04:10&#39;, &#39;04:13&#39;, &#39;04:16&#39;, &#39;(04:19)&#39;]
Checkmarked:  1

Output without check mark:

Date:  09.03.2023
Time:  01:08:10
[&#39;04:10&#39;, &#39;04:13&#39;, &#39;04:16&#39;, &#39;(04:19)&#39;]
Checkmarked: NA

Of course, instead of printing you can add the output to a list, etc.

答案2

得分: 0

不确定预期结果应该如何或有多少项需要迭代,因此此代码应该指向一个方向。

问题在于string=&#39;⚽ Jogos:&#39;需要完全匹配。

示例:

from bs4 import BeautifulSoup

html = '''
<div class="body">
   <div class="pull_right date details" title="09.03.2023 01:08:10 UTC-03:00">
      01:08
   </div>
   <div class="from_name">
      &#129302;&#129351; &#119916;&#119938;&#119956;&#119962; &#119913;&#119952;&#119957; - &#119926;&#119959;&#119942;&#119955; 2.5
   </div>
   <div class="text">
      Easy Bot - Over 2.5&lt;br&gt;&lt;br&gt;&#127942; Liga: Sul-Americana&lt;br&gt;&#128678; Entrada: Over 2.5 FT&lt;br&gt;⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)&lt;br&gt;&lt;br&gt;&lt;strong&gt;Link: &lt;/strong&gt;&lt;a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a>&lt;br&gt;&lt;br&gt;&#127808; 24h:100% de acerto nas &#250;ltimas 24h&lt;br&gt;&lt;br&gt;✅✅✅✅✅✅ .
   </div>
</div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.text'):
    for s in e.stripped_strings:
        if 'Jogos' in s:
            s=s.split()[2:]
            jogo_times = [t for t in s if '✅' not in t]
            jogo_check = 
展开收缩
for i,t in enumerate(s) if '✅' in t]
d = { 'date':e.find_previous('div',{'class':'date'}).get('title')[:10], 'time':e.find_previous('div',{'class':'date'}).get_text(strip=True), 'jogo_times':jogo_times, 'jogo_time_checked':jogo_check } break if d: data.append(d) d = None else: print('no jogo') data

输出:

[{'date': '09.03.2023',
  'time': '01:08',
  'jogo_times': ['04:10', '04:13', '04:16', '(04:19)'],
  'jogo_time_checked': ['04:10']}]
英文:

Not sure how the expected result should look like or how many items are there to iterate, so this code should point into a direction.


Issue here is that string=&#39;⚽ Jogos:&#39; needs an exact match.

Example

from bs4 import BeautifulSoup

html = &#39;&#39;&#39;
&lt;div class=&quot;body&quot;&gt;
   &lt;div class=&quot;pull_right date details&quot; title=&quot;09.03.2023 01:08:10 UTC-03:00&quot;&gt;
      01:08
   &lt;/div&gt;
   &lt;div class=&quot;from_name&quot;&gt;
      &#129302;&#129351; &#119916;&#119938;&#119956;&#119962; &#119913;&#119952;&#119957; - &#119926;&#119959;&#119942;&#119955; 2.5 
   &lt;/div&gt;
   &lt;div class=&quot;text&quot;&gt;
      Easy Bot - Over 2.5&lt;br&gt;&lt;br&gt;&#127942; Liga: Sul-Americana&lt;br&gt;&#128678; Entrada: Over 2.5 FT&lt;br&gt;⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)&lt;br&gt;&lt;br&gt;&lt;strong&gt;Link: &lt;/strong&gt;&lt;a href=&quot;https://www.bet365.com/#/AVR/B146/R%5E1/&quot;&gt;https://www.bet365.com/#/AVR/B146/R%5E1/&lt;/a&gt;&lt;br&gt;&lt;br&gt;&#127808; 24h:100% de acerto nas &#250;ltimas 24h&lt;br&gt;&lt;br&gt;✅✅✅✅✅✅ .
   &lt;/div&gt;
&lt;/div&gt;
&#39;&#39;&#39;
soup = BeautifulSoup(html)

data = []

for e in soup.select(&#39;.text&#39;):    
    for s in e.stripped_strings:
        if &#39;Jogos&#39; in s:
            s=s.split()[2:]
            jogo_times = [t for t in s if &#39;✅&#39; not in t]
            jogo_check = 
展开收缩
for i,t in enumerate(s) if &#39;✅&#39; in t] d = { &#39;date&#39;:e.find_previous(&#39;div&#39;,{&#39;class&#39;:&#39;date&#39;}).get(&#39;title&#39;)[:10], &#39;time&#39;:e.find_previous(&#39;div&#39;,{&#39;class&#39;:&#39;date&#39;}).get_text(strip=True), &#39;jogo_times&#39;:jogo_times, &#39;jogo_time_checked&#39;:jogo_check } break if d: data.append(d) d = None else: print(&#39;no jogo&#39;) data

Output

[{&#39;date&#39;: &#39;09.03.2023&#39;,
  &#39;time&#39;: &#39;01:08&#39;,
  &#39;jogo_times&#39;: [&#39;04:10&#39;, &#39;04:13&#39;, &#39;04:16&#39;, &#39;(04:19)&#39;],
  &#39;jogo_time_checked&#39;: [&#39;04:10&#39;]}]

huangapple
  • 本文由 发表于 2023年3月9日 19:10:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75683797.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定