2023年3月9日 19:10:24go评论63阅读模式

英文:

How to extract specific data from HTML?

问题

这个代码的问题在于它假定"✅"符号总是出现在Jogos时间列表中，并且总是在第一个Jogo时间后。但实际情况可能是"✅"符号出现在任何一个Jogo时间后，也可能没有任何"✅"符号。

要解决这个问题，你可以使用以下方式来提取数据：

from bs4 import BeautifulSoup

html = """
<div class="body">
   <div class="pull_right date details" title="09.03.2023 01:08:10 UTC-03:00">
      01:08
   </div>
   <div class="from_name">
      &#129302;&#129351; &#119916;&#119938;&#119956;&#119962; &#119913;&#119952;&#119957; - &#119926;&#119959;&#119942;&#119955; 2.5 
   </div>
   <div class="text">
      Easy Bot - Over 2.5<br><br>⚽ Liga: Sul-Americana<br>⚽ Entrada: Over 2.5 FT<br>⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)<br><br><strong>Link: </strong><a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a><br><br>24h:100% de acerto nas últimas 24h<br><br>✅✅✅✅✅✅ .
   </div>
</div>
"""

# 解析HTML
soup = BeautifulSoup(html, 'html.parser')

d_str = soup.select_one('div.date.details')['title']
calendar = d_str.split(" ")
print("Date:", calendar[0])
print("Time:", calendar[1])

text_div = soup.select_one('div.text')
for sts in text_div.stripped_strings:
    if "Jogos: ✅" in sts:
        jogos_info = sts.split("Jogos: ✅")[1]
        jogos = jogos_info.split()
        checkmarked = jogos.index('04:10') + 1 if '04:10' in jogos else 0
        print("Jogo 1:", jogos[0])
        print("Checkmarked:", checkmarked)

这段代码会尝试从文本中提取Jogos时间信息，并找出"✅"符号的位置。如果没有"✅"符号，它会将Checkmarked设置为0。这样，无论"✅"符号出现在哪个Jogo时间后，都能正确提取数据。

英文:

I need to extract from HTML the following data: Date, Time, the first Jogo and, if the checkmark symbol is present on the first, second, third or fourth time or display 0 if there's no checkmark.

The returns fine all data when the checkmark is present, but when there's no checkmark it just displays the Date and Time.

My python code is:


from bs4 import BeautifulSoup

html = &quot;&quot;&quot;
&lt;div class=&quot;body&quot;&gt;
   &lt;div class=&quot;pull_right date details&quot; title=&quot;09.03.2023 01:08:10 UTC-03:00&quot;&gt;
      01:08
   &lt;/div&gt;
   &lt;div class=&quot;from_name&quot;&gt;
      &#129302;&#129351; &#119916;&#119938;&#119956;&#119962; &#119913;&#119952;&#119957; - &#119926;&#119959;&#119942;&#119955; 2.5 
   &lt;/div&gt;
   &lt;div class=&quot;text&quot;&gt;
      Easy Bot - Over 2.5&lt;br&gt;&lt;br&gt;&#127942; Liga: Sul-Americana&lt;br&gt;&#128678; Entrada: Over 2.5 FT&lt;br&gt;⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)&lt;br&gt;&lt;br&gt;&lt;strong&gt;Link: &lt;/strong&gt;&lt;a href=&quot;https://www.bet365.com/#/AVR/B146/R%5E1/&quot;&gt;https://www.bet365.com/#/AVR/B146/R%5E1/&lt;/a&gt;&lt;br&gt;&lt;br&gt;&#127808; 24h:100% de acerto nas &#250;ltimas 24h&lt;br&gt;&lt;br&gt;✅✅✅✅✅✅ .
   &lt;/div&gt;
&lt;/div&gt;
&quot;&quot;&quot;

# parse the HTML
soup = BeautifulSoup(html, &#39;html.parser&#39;)

d_str = soup.select_one(&#39;div.date.details&#39;)[&#39;title&#39;]
calendar = d_str.split(&quot; &quot;)
print(&quot;Date: &quot;,calendar[0])
print(&quot;Time: &quot;,calendar[1])
for sts in soup.select_one(&#39;div.text&#39;).stripped_strings:
    if &quot;⚽ Jogos: &quot; in sts:
        jugos = (sts.split(&#39;⚽ Jogos: &#39;)[1].split(&quot; &quot;))
        ind = jugos.index(&#39;✅&#39;)+1
        jugos.remove(&quot;✅&quot;)
        print(&quot;Jogo 1: &quot;, jugos[0])
        print(&quot;Checkmarked: &quot;, ind)

Note that the checkmark "belongs" to the time. So it can be on the first, second, third or fourth Jogos time.

But sometimes there will be no checkmark. when this happens the output should be:

Date: XXX
Time: XXX
Jogo 1: XXX
Checkmarked: 0

So, what's the problem with this code?

答案1

得分: 1

我会这样做：

d_str = soup.select_one('div.date.details')['title']
calendar = d_str.split(" ")
print("日期: ", calendar[0])
print("时间: ", calendar[1])
for sts in soup.select_one('div.text').stripped_strings:
    if "⚽ Jogos: " in sts:
        jogos = (sts.split('⚽ Jogos: ')[1].split(" "))
        if "✅" in jogos:
            ind = jogos.index('✅')+1
            print("已标记: ", ind)
            jogos.remove("✅")
            print(jogos)
    else:
        print(jogos)
        print("已标记: 无")

带有检查标记的输出：

日期: 09.03.2023
时间: 01:08:10
['04:10', '04:13', '04:16', '(04:19)']
已标记: 1

没有检查标记的输出：

日期: 09.03.2023
时间: 01:08:10
['04:10', '04:13', '04:16', '(04:19)']
已标记: 无

当然，你可以将输出添加到列表等其他数据结构中，而不是直接打印出来。

英文:

I would do it this way:

d_str = soup.select_one(&#39;div.date.details&#39;)[&#39;title&#39;]
calendar = d_str.split(&quot; &quot;)
print(&quot;Date: &quot;,calendar[0])
print(&quot;Time: &quot;,calendar[1])
for sts in soup.select_one(&#39;div.text&#39;).stripped_strings:
    if &quot;⚽ Jogos: &quot; in sts:
        jugos = (sts.split(&#39;⚽ Jogos: &#39;)[1].split(&quot; &quot;))
        if &quot;✅&quot; in jugos:
           ind = jugos.index(&#39;✅&#39;)+1
           print(&quot;Checkmarked: &quot;, ind)
           jugos.remove(&quot;✅&quot;)
           print(jugos)
    else:
        print(jugos)
        print(&quot;Checkmarked: NA&quot;)

Output with check mark:

Date:  09.03.2023
Time:  01:08:10
[&#39;04:10&#39;, &#39;04:13&#39;, &#39;04:16&#39;, &#39;(04:19)&#39;]
Checkmarked:  1

Output without check mark:

Date:  09.03.2023
Time:  01:08:10
[&#39;04:10&#39;, &#39;04:13&#39;, &#39;04:16&#39;, &#39;(04:19)&#39;]
Checkmarked: NA

Of course, instead of printing you can add the output to a list, etc.

答案2

得分: 0

不确定预期结果应该如何或有多少项需要迭代，因此此代码应该指向一个方向。

问题在于string='⚽ Jogos:'需要完全匹配。

示例：

from bs4 import BeautifulSoup

html = '''
<div class="body">
   <div class="pull_right date details" title="09.03.2023 01:08:10 UTC-03:00">
      01:08
   </div>
   <div class="from_name">
      &#129302;&#129351; &#119916;&#119938;&#119956;&#119962; &#119913;&#119952;&#119957; - &#119926;&#119959;&#119942;&#119955; 2.5
   </div>
   <div class="text">
      Easy Bot - Over 2.5&lt;br&gt;&lt;br&gt;&#127942; Liga: Sul-Americana&lt;br&gt;&#128678; Entrada: Over 2.5 FT&lt;br&gt;⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)&lt;br&gt;&lt;br&gt;&lt;strong&gt;Link: &lt;/strong&gt;&lt;a href="https://www.bet365.com/#/AVR/B146/R%5E1/">https://www.bet365.com/#/AVR/B146/R%5E1/</a>&lt;br&gt;&lt;br&gt;&#127808; 24h:100% de acerto nas &#250;ltimas 24h&lt;br&gt;&lt;br&gt;✅✅✅✅✅✅ .
   </div>
</div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.text'):
    for s in e.stripped_strings:
        if 'Jogos' in s:
            s=s.split()[2:]
            jogo_times = [t for t in s if '✅' not in t]
            jogo_check = 展开收缩
 for i,t in enumerate(s) if '✅' in t]
            d = {
                'date':e.find_previous('div',{'class':'date'}).get('title')[:10],
                'time':e.find_previous('div',{'class':'date'}).get_text(strip=True),
                'jogo_times':jogo_times,
                'jogo_time_checked':jogo_check
            }
            break
    if d:
        data.append(d)
        d = None
    else:
        print('no jogo')
data

输出：

[{'date': '09.03.2023',
  'time': '01:08',
  'jogo_times': ['04:10', '04:13', '04:16', '(04:19)'],
  'jogo_time_checked': ['04:10']}]

英文:

Not sure how the expected result should look like or how many items are there to iterate, so this code should point into a direction.

Issue here is that string='⚽ Jogos:' needs an exact match.

Example

from bs4 import BeautifulSoup

html = &#39;&#39;&#39;
&lt;div class=&quot;body&quot;&gt;
   &lt;div class=&quot;pull_right date details&quot; title=&quot;09.03.2023 01:08:10 UTC-03:00&quot;&gt;
      01:08
   &lt;/div&gt;
   &lt;div class=&quot;from_name&quot;&gt;
      &#129302;&#129351; &#119916;&#119938;&#119956;&#119962; &#119913;&#119952;&#119957; - &#119926;&#119959;&#119942;&#119955; 2.5 
   &lt;/div&gt;
   &lt;div class=&quot;text&quot;&gt;
      Easy Bot - Over 2.5&lt;br&gt;&lt;br&gt;&#127942; Liga: Sul-Americana&lt;br&gt;&#128678; Entrada: Over 2.5 FT&lt;br&gt;⚽ Jogos: ✅ 04:10 04:13 04:16 (04:19)&lt;br&gt;&lt;br&gt;&lt;strong&gt;Link: &lt;/strong&gt;&lt;a href=&quot;https://www.bet365.com/#/AVR/B146/R%5E1/&quot;&gt;https://www.bet365.com/#/AVR/B146/R%5E1/&lt;/a&gt;&lt;br&gt;&lt;br&gt;&#127808; 24h:100% de acerto nas &#250;ltimas 24h&lt;br&gt;&lt;br&gt;✅✅✅✅✅✅ .
   &lt;/div&gt;
&lt;/div&gt;
&#39;&#39;&#39;
soup = BeautifulSoup(html)

data = []

for e in soup.select(&#39;.text&#39;):    
    for s in e.stripped_strings:
        if &#39;Jogos&#39; in s:
            s=s.split()[2:]
            jogo_times = [t for t in s if &#39;✅&#39; not in t]
            jogo_check = 展开收缩 for i,t in enumerate(s) if &#39;✅&#39; in t]
            d = {
                &#39;date&#39;:e.find_previous(&#39;div&#39;,{&#39;class&#39;:&#39;date&#39;}).get(&#39;title&#39;)[:10],
                &#39;time&#39;:e.find_previous(&#39;div&#39;,{&#39;class&#39;:&#39;date&#39;}).get_text(strip=True),
                &#39;jogo_times&#39;:jogo_times,
                &#39;jogo_time_checked&#39;:jogo_check
            }
            break
    if d:
        data.append(d)
        d = None   
    else:
        print(&#39;no jogo&#39;)
data

Output

[{&#39;date&#39;: &#39;09.03.2023&#39;,
  &#39;time&#39;: &#39;01:08&#39;,
  &#39;jogo_times&#39;: [&#39;04:10&#39;, &#39;04:13&#39;, &#39;04:16&#39;, &#39;(04:19)&#39;],
  &#39;jogo_time_checked&#39;: [&#39;04:10&#39;]}]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从HTML中提取特定数据？

问题

答案1

答案2

Example

Output

如何在点击按钮时获取出现的数据？

从字符串列表中获取数字列表

无法导入langchain.agents.load_tools

将DataFrame中现有列的值更改为单个特定值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论