英文:
How extracting date and next two values using BeautifulSoup?
问题
我想要做的是找到class='date_rec'
的日期,然后提取日期,然后仅提取接下来的2行中的'data_rec_v'
值。
在这种情况下,提取的值将是'2/27/2023', '0.03', '0.43'
。我已经尝试了许多BS4尝试,但没有成功。有人可以帮助我吗?非常感谢任何有建设性的建议!
英文:
I have an HTML file that has multiple tables that look like this:
<tr><td class='date_rec'>2/27/2023</td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.03</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.43</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>143</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>7.4</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>49.1</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>35.8</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>41.2</td><td class='data_rec_f'>Y</td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>93</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>65</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>84</td><td class='data_rec_f'>Y</td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>36.9</td><td class='data_rec_f'>Y</td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>7.1</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>170.2</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>44.5</td><td class='data_rec_f'>Y</td></tr></table></td>
</tr>
What I'd like to do is find the date class='date_rec'
, extract the date and then extract the 'data_rec_v'
value for only the next 2 lines.
In this case, the extracted values would be '2/27/2023', '0.03', '0.43'
. I've made many BS4 attempts without success. Can anyone give me a hand? Thanks in advance for any constructive suggestions!
答案1
得分: 0
以下是翻译好的部分:
你可以使用以下方法。要搜索日期,指定 'td' 标签和 'date_rec' 类,这将返回找到的第一个元素。使用 find_all 来搜索 'data_rec_v' 类,将返回具有该类的所有元素,然后你可以通过引用列表中的位置来提取它们。以下是代码示例:
from bs4 import BeautifulSoup
file = open('date.html', 'r')
soup = BeautifulSoup(file, "lxml")
data = soup.find('td', class_='date_rec')
data_v = soup.find_all('td', class_='data_rec_v')[:2]
print(f'date - {data.text}')
for d in data_v:
print(f'data_v - {d.text}')
将返回:
date - 2/27/2023
data_v - 0.03
data_v - 0.43
英文:
You can use the following method. To search for a find date, specify the 'td' tag and the 'date_rec' class, this will return you the first element found. A search for the 'data_rec_v' class, using find_all, will return to you all the elements with this class, then you can extract them by referring to the position in the list. Here is an example of the code:
from bs4 import BeautifulSoup
file = open('date.html', 'r')
soup = BeautifulSoup(file, "lxml")
data = soup.find('td', class_='date_rec')
data_v = soup.find_all('td', class_='data_rec_v')[:2]
print(f'date - {data.text}')
for d in data_v:
print(f'data_v - {d.text}')
will return:
date - 2/27/2023
data_v - 0.03
data_v - 0.43
答案2
得分: 0
我创建了一个空列表来接收'td'的值,还创建了一个具有在类'class = date_rec'中的数据属性的字典,然后我将一切都放在同一个字典'data'中
from bs4 import BeautifulSoup
import pandas as pd
value = []
data = {}
with open('html.html', 'r') as fd:
soup = BeautifulSoup(fd, 'html.parser')
soup.contents
values = soup.find_all('td',{'class':'data_rec_v'})
date = soup.find('td',{'class':'date_rec'})
date = str(date.text)
for i in values:
finder_values = i.text
value.append(finder_values)
data[date] = value
df = pd.DataFrame(data)
print(df)
字典:
{'2/27/2023': ['0.03', '0.43', '143', '7.4', '49.1', '35.8', '41.2', '93', '65', '84', '36.9', '7.1', '170.2', '44.5']}
英文:
I created an empty list to receive the values of 'td' and also created a dictionary with a data property that is in the class 'class='date_rec', then I put everything in the same dictionary 'data'
from bs4 import BeautifulSoup
import pandas as pd
value = []
data = {}
with open('html.html', 'r') as fd:
soup = BeautifulSoup(fd, 'html.parser')
soup.contents
values = soup.find_all('td',{'class':'data_rec_v'})
date = soup.find('td',{'class':'date_rec'})
date = str(date.text)
for i in values:
finder_values = i.text
value.append(finder_values)
data[date] = value
df = pd.DataFrame(data)
print(df)
dict:
{'2/27/2023': ['0.03', '0.43', '143', '7.4', '49.1', '35.8', '41.2', '93', '65', '84', '36.9', '7.1', '170.2', '44.5']}
答案3
得分: 0
有多种方法可以实现你的目标 - 假设不仅需要提取一行,你可以使用 list
或 dict
来存储这些数值。
以下的 dict comprehension
用于选择所有具有 class="date_rec"
的行,通过 stripped_strings
提取出所有可以切片到前三个的文本:
[
list(e.stripped_strings)[:3]
for e in soup.select('tr:has(>td.date_rec)')
]
或者不存储数值,仅仅打印出来:
for e in soup.select('tr:has(>td.date_rec)'):
print(
','.join(
list(e.stripped_strings)[:3]
)
)
示例
from bs4 import BeautifulSoup
html = '''
<tr><td class='date_rec'>2/27/2023</td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.03</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.43</td><td class='data_rec_f'> </td></tr></table></td>
<!-- ... 其他行 ... -->
</tr>
'''
soup = BeautifulSoup(html)
[
list(e.stripped_strings)[:3]
for e in soup.select('tr:has(>td.date_rec)')
]
输出
[['2/27/2023', '0.03', '0.43']]
英文:
There ara various ways to get your goal - Assuming it is not the only row to extract you could use list
or dict
to store the values.
Following dict comprehension
is selecting all rows with a class="date_rec"
, extracts via stripped_strings
alle texts that could be sliced to the first three:
[
list(e.stripped_strings)[:3]
for e in soup.select('tr:has(>td.date_rec)')
]
Or without storing values simply print:
for e in soup.select('tr:has(>td.date_rec)'):
print(
','.join(
list(e.stripped_strings)[:3]
)
)
Example
from bs4 import BeautifulSoup
html = '''
<tr><td class='date_rec'>2/27/2023</td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.03</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>0.43</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>143</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>7.4</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>49.1</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>35.8</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>41.2</td><td class='data_rec_f'>Y</td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>93</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>65</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>84</td><td class='data_rec_f'>Y</td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>36.9</td><td class='data_rec_f'>Y</td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>7.1</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>170.2</td><td class='data_rec_f'> </td></tr></table></td>
<td class='date_rec'><table class='inside_table'><tr><td class='data_rec_v'>44.5</td><td class='data_rec_f'>Y</td></tr></table></td>
</tr>
'''
soup = BeautifulSoup(html)
[
list(e.stripped_strings)[:3]
for e in soup.select('tr:has(>td.date_rec)')
]
Output
[['2/27/2023', '0.03', '0.43']]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论