英文:
Python pandas reading table data but commas is removed
问题
如何获取“Winning No”列的原始数据,以保留逗号而不移除?
结果应该是1,45,13,12,13而不是145131213。
以下是我的代码:
import requests
import pandas as pd
pd.options.display.width = 0
pd.options.display.max_rows = 1000
url = 'https://en.lottolyzer.com/history/singapore/toto'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
英文:
how do I have the original data for column "Winning No". How to have the comma not removed?
Result should be 1,45,13,12,13 instead of 145131213
Here my code:
import requests
import pandas as pd
pd.options.display.width = 0
pd.options.display.max_rows = 1000
url = 'https://en.lottolyzer.com/history/singapore/toto'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')
答案1
得分: 0
以下是您要翻译的内容:
通过查看read_html
文档,我发现参数decimal
和thousands
在这些情况下可能会有帮助,但在这个特定情况下不起作用。 鉴于此,我提出了一种自定义解决方案,还利用了BeautifulSoup
库以更轻松地解析HTML结果。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://en.lottolyzer.com/history/singapore/toto'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# 从HTML表中提取初始DataFrame。
# 请注意,read_html返回一个DataFrame列表(从HTML检索的每个表格都有一个),
# 所以我们需要选择第一个元素(在网页上只有一个表格)。
df_initial = pd.read_html(html)[0]
# 查找包含中奖号码的HTML表
table = soup.find('table', class_='responsive-table')
# 提取“中奖号码”列并添加逗号
# 此块的最后一行用于保留原始HTML表格的格式,
# 原始HTML表格在最后一行中还报告了列标签。
winning_no = table.find_all('td', class_='sum-p1')[1::3]
winning_no = [num.text for num in winning_no]
winning_no.append(winning_no.pop(0))
# 使用“中奖号码”列创建DataFrame
df = pd.DataFrame({'中奖号码': [num for num in winning_no]})
# 用新解析的列替换初始DataFrame中的“中奖号码”列。
df_initial['中奖号码'] = df['中奖号码']
print(df_initial)
简而言之:
这行代码 winning_no = table.find_all('td', class_='sum-p1')[1::3]
查找所有带有类sum-p1
的表行,通过检查结果,可以得到以下输出:
<td class="sum-p1">Date</td>,
<td class="sum-p1">中奖号码</td>,
<td class="sum-p1">Addl No.</td>,
<td class="sum-p1">2023-05-08</td>,
<td class="sum-p1">1,3,6,11,23,34</td>,
<td class="sum-p1">39</td>,
<td class="sum-p1">2023-05-04</td>,
<td class="sum-p1">16,21,25,28,37,44</td>,
<td class="sum-p1">24</td>,
...
]
如您所见,中奖号码
元素从该数组的索引1的元素开始,并且可通过从该索引开始到末尾以3的步幅([1::3])访问数组来获取。
相比之下,这行代码 winning_no.append(winning_no.pop(0))
用于删除标题,即从列表的开头删除字符串 中奖号码
并将其放在最后一个位置。 这样,当我们在下一步中恢复DataFrame时,将保留表的原始结构,使列的标签既位于表格顶部又位于其最后一行。
英文:
By inspecting the read_html
documentation I found out that the parameters decimal
and thousands
could be helpful in these cases, but they do not work in this specific one. With that being said, I propose a customized solution which makes use also of the BeautifulSoup
library in order to parse the HTML result in an easier way.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://en.lottolyzer.com/history/singapore/toto'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# Extract the initial DataFrame from the HTML table.
# Note that read_html returns a list of DataFrames (one for each table retrieved from the HTML),
# so we need to select the first element (in the webpage there is only one table).
df_initial = pd.read_html(html)[0]
# Find the HTML table containing the winning numbers
table = soup.find('table', class_='responsive-table')
# Extract the "Winning No" column and add commas
# The last row of this block is used to preserve the format of the original HTML table,
# which reports the column label also into the last line.
winning_no = table.find_all('td', class_='sum-p1')[1::3]
winning_no = [num.text for num in winning_no]
winning_no.append(winning_no.pop(0))
# Create a DataFrame with the "Winning No" column
df = pd.DataFrame({'Winning No.': [num for num in winning_no]})
# Replacing the "Winning No" column in the initial DataFrame with the new parsed column.
df_initial['Winning No.'] = df['Winning No.']
print(df_initial)
TL;DR:
the line winning_no = table.find_all('td', class_='sum-p1')[1::3]
finds all the table rows with class sum-p1
which by inspecting the results give this output:
<td class="sum-p1">Date</td>,
<td class="sum-p1">Winning No.</td>,
<td class="sum-p1">Addl No.</td>,
<td class="sum-p1">2023-05-08</td>,
<td class="sum-p1">1,3,6,11,23,34</td>,
<td class="sum-p1">39</td>,
<td class="sum-p1">2023-05-04</td>,
<td class="sum-p1">16,21,25,28,37,44</td>,
<td class="sum-p1">24</td>,
...
]
As you can see the Winning No.
elements starts from the element of index 1 of this array and are reachable by accessing the array starting from that index to the end with step 3 ([1::3]).
This line instead winning_no.append(winning_no.pop(0))
is used to remove the Header, i.e. the string Winning No.
from the head of the list and put it in the last position. In this way when in the next step we'll create back the dataframe we will preserve the original structure of the table, having the label of the column both on the top of the table and in its last row.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论