如何使用BeautifulSoup提取日期和接下来的两个数值?

huangapple go评论54阅读模式
英文:

How extracting date and next two values using BeautifulSoup?

问题

我想要做的是找到class='date_rec'的日期,然后提取日期,然后仅提取接下来的2行中的'data_rec_v'值。

在这种情况下,提取的值将是'2/27/2023', '0.03', '0.43'。我已经尝试了许多BS4尝试,但没有成功。有人可以帮助我吗?非常感谢任何有建设性的建议!

英文:

I have an HTML file that has multiple tables that look like this:

<tr><td class='date_rec'>2/27/2023</td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.03</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.43</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>143</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>7.4</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>49.1</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>35.8</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>41.2</td><td class='data_rec_f'>Y</td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>93</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>65</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>84</td><td class='data_rec_f'>Y</td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>36.9</td><td class='data_rec_f'>Y</td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>7.1</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>170.2</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>44.5</td><td class='data_rec_f'>Y</td></tr></table></td>
</tr>

What I'd like to do is find the date class='date_rec', extract the date and then extract the 'data_rec_v' value for only the next 2 lines.

In this case, the extracted values would be '2/27/2023', '0.03', '0.43'. I've made many BS4 attempts without success. Can anyone give me a hand? Thanks in advance for any constructive suggestions!

答案1

得分: 0

以下是翻译好的部分:

你可以使用以下方法。要搜索日期,指定 'td' 标签和 'date_rec' 类,这将返回找到的第一个元素。使用 find_all 来搜索 'data_rec_v' 类,将返回具有该类的所有元素,然后你可以通过引用列表中的位置来提取它们。以下是代码示例:

from bs4 import BeautifulSoup

file = open('date.html', 'r')
soup = BeautifulSoup(file, "lxml")
data = soup.find('td', class_='date_rec')
data_v = soup.find_all('td', class_='data_rec_v')[:2]
print(f'date - {data.text}')
for d in data_v:
    print(f'data_v - {d.text}')

将返回:

date - 2/27/2023
data_v - 0.03
data_v - 0.43
英文:

You can use the following method. To search for a find date, specify the 'td' tag and the 'date_rec' class, this will return you the first element found. A search for the 'data_rec_v' class, using find_all, will return to you all the elements with this class, then you can extract them by referring to the position in the list. Here is an example of the code:

from bs4 import BeautifulSoup


file = open('date.html', 'r')
soup = BeautifulSoup(file, "lxml")
data = soup.find('td', class_='date_rec')
data_v = soup.find_all('td', class_='data_rec_v')[:2]
print(f'date - {data.text}')
for d in data_v:
    print(f'data_v - {d.text}')

will return:

date - 2/27/2023
data_v - 0.03
data_v - 0.43

答案2

得分: 0

我创建了一个空列表来接收'td'的值,还创建了一个具有在类'class = date_rec'中的数据属性的字典,然后我将一切都放在同一个字典'data'中

from bs4 import BeautifulSoup
import pandas as pd

value = []
data = {}
with open('html.html', 'r') as fd:
    soup = BeautifulSoup(fd, 'html.parser')
    soup.contents
    values = soup.find_all('td',{'class':'data_rec_v'})
    date = soup.find('td',{'class':'date_rec'})

    date = str(date.text)

    for i in values:
        finder_values = i.text

        value.append(finder_values)

    data[date] = value

    df = pd.DataFrame(data)
    print(df)

字典:
{'2/27/2023': ['0.03', '0.43', '143', '7.4', '49.1', '35.8', '41.2', '93', '65', '84', '36.9', '7.1', '170.2', '44.5']}

英文:

I created an empty list to receive the values ​​of 'td' and also created a dictionary with a data property that is in the class 'class='date_rec', then I put everything in the same dictionary 'data'

from bs4 import BeautifulSoup
import pandas as pd

value = []
data = {}
with open('html.html', 'r') as fd:
    soup = BeautifulSoup(fd, 'html.parser')
    soup.contents
    values = soup.find_all('td',{'class':'data_rec_v'})
    date = soup.find('td',{'class':'date_rec'})

    date = str(date.text)

    for i in values:
        finder_values = i.text
    
        value.append(finder_values)

    data[date] = value

    df = pd.DataFrame(data)
    print(df)

dict:
{'2/27/2023': ['0.03', '0.43', '143', '7.4', '49.1', '35.8', '41.2', '93', '65', '84', '36.9', '7.1', '170.2', '44.5']}

答案3

得分: 0

有多种方法可以实现你的目标 - 假设不仅需要提取一行,你可以使用 listdict 来存储这些数值。

以下的 dict comprehension 用于选择所有具有 class="date_rec" 的行,通过 stripped_strings 提取出所有可以切片到前三个的文本:

[
    list(e.stripped_strings)[:3]
    for e in soup.select('tr:has(>td.date_rec)')
]

或者不存储数值,仅仅打印出来:

for e in soup.select('tr:has(>td.date_rec)'):
    print(
        ','.join(
            list(e.stripped_strings)[:3]
        ) 
    )

示例

from bs4 import BeautifulSoup

html = '''
<tr><td class='date_rec'>2/27/2023</td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.03</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.43</td><td class='data_rec_f'> </td></tr></table></td>
<!-- ... 其他行 ... -->
</tr>
'''
soup = BeautifulSoup(html)

[
    list(e.stripped_strings)[:3]
    for e in soup.select('tr:has(>td.date_rec)')
]

输出

[['2/27/2023', '0.03', '0.43']]
英文:

There ara various ways to get your goal - Assuming it is not the only row to extract you could use list or dict to store the values.

Following dict comprehension is selecting all rows with a class=&quot;date_rec&quot;, extracts via stripped_strings alle texts that could be sliced to the first three:

[
    list(e.stripped_strings)[:3]
    for e in soup.select(&#39;tr:has(&gt;td.date_rec)&#39;)
]

Or without storing values simply print:

for e in soup.select(&#39;tr:has(&gt;td.date_rec)&#39;):
    print(
        &#39;,&#39;.join(
            list(e.stripped_strings)[:3]
        ) 
    )

Example

from bs4 import BeautifulSoup

html = &#39;&#39;&#39;
&lt;tr&gt;&lt;td class=&#39;date_rec&#39;&gt;2/27/2023&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;0.03&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;0.43&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;143&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;7.4&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;49.1&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;35.8&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;41.2&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt;Y&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;93&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;65&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;84&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt;Y&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;36.9&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt;Y&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;7.1&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;170.2&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt; &lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;td class=&#39;date_rec&#39;&gt;&lt;table class=&#39;inside_table&#39;&gt;&lt;tr&gt;&lt;td class=&#39;data_rec_v&#39;&gt;44.5&lt;/td&gt;&lt;td class=&#39;data_rec_f&#39;&gt;Y&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/td&gt;
&lt;/tr&gt;
&#39;&#39;&#39;
soup = BeautifulSoup(html)

[
    list(e.stripped_strings)[:3]
    for e in soup.select(&#39;tr:has(&gt;td.date_rec)&#39;)
]

Output

[[&#39;2/27/2023&#39;, &#39;0.03&#39;, &#39;0.43&#39;]]

huangapple
  • 本文由 发表于 2023年3月7日 09:42:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657336.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定