Python pandas读取表格数据,但逗号被移除。

huangapple go评论68阅读模式
英文:

Python pandas reading table data but commas is removed

问题

如何获取“Winning No”列的原始数据,以保留逗号而不移除?

结果应该是1,45,13,12,13而不是145131213。

以下是我的代码:

import requests
import pandas as pd

pd.options.display.width = 0
pd.options.display.max_rows = 1000

url = 'https://en.lottolyzer.com/history/singapore/toto'
html = requests.get(url).content
df_list = pd.read_html(html)

df = df_list[-1]
print(df)
df.to_csv('my data.csv')
英文:

how do I have the original data for column "Winning No". How to have the comma not removed?

Result should be 1,45,13,12,13 instead of 145131213

Here my code:

import requests
import pandas as pd

pd.options.display.width = 0
pd.options.display.max_rows = 1000

url = 'https://en.lottolyzer.com/history/singapore/toto'
html = requests.get(url).content
df_list = pd.read_html(html)



df = df_list[-1]
print(df)
df.to_csv('my data.csv')

答案1

得分: 0

以下是您要翻译的内容:

通过查看read_html文档,我发现参数decimalthousands在这些情况下可能会有帮助,但在这个特定情况下不起作用。 鉴于此,我提出了一种自定义解决方案,还利用了BeautifulSoup库以更轻松地解析HTML结果。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.lottolyzer.com/history/singapore/toto'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

# 从HTML表中提取初始DataFrame。
# 请注意,read_html返回一个DataFrame列表(从HTML检索的每个表格都有一个), 
# 所以我们需要选择第一个元素(在网页上只有一个表格)。
df_initial = pd.read_html(html)[0]

# 查找包含中奖号码的HTML表
table = soup.find('table', class_='responsive-table')

# 提取“中奖号码”列并添加逗号
# 此块的最后一行用于保留原始HTML表格的格式,
# 原始HTML表格在最后一行中还报告了列标签。
winning_no = table.find_all('td', class_='sum-p1')[1::3]
winning_no = [num.text for num in winning_no]
winning_no.append(winning_no.pop(0))

# 使用“中奖号码”列创建DataFrame
df = pd.DataFrame({'中奖号码': [num for num in winning_no]})

# 用新解析的列替换初始DataFrame中的“中奖号码”列。
df_initial['中奖号码'] = df['中奖号码']
print(df_initial)

简而言之:

这行代码 winning_no = table.find_all('td', class_='sum-p1')[1::3] 查找所有带有类sum-p1的表行,通过检查结果,可以得到以下输出:

    <td class="sum-p1">Date</td>, 
    <td class="sum-p1">中奖号码</td>, 
    <td class="sum-p1">Addl No.</td>, 
    <td class="sum-p1">2023-05-08</td>, 
    <td class="sum-p1">1,3,6,11,23,34</td>, 
    <td class="sum-p1">39</td>, 
    <td class="sum-p1">2023-05-04</td>, 
    <td class="sum-p1">16,21,25,28,37,44</td>, 
    <td class="sum-p1">24</td>, 
    ...
]

如您所见,中奖号码 元素从该数组的索引1的元素开始,并且可通过从该索引开始到末尾以3的步幅([1::3])访问数组来获取。

相比之下,这行代码 winning_no.append(winning_no.pop(0)) 用于删除标题,即从列表的开头删除字符串 中奖号码 并将其放在最后一个位置。 这样,当我们在下一步中恢复DataFrame时,将保留表的原始结构,使列的标签既位于表格顶部又位于其最后一行。

英文:

By inspecting the read_html documentation I found out that the parameters decimal and thousands could be helpful in these cases, but they do not work in this specific one. With that being said, I propose a customized solution which makes use also of the BeautifulSoup library in order to parse the HTML result in an easier way.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = &#39;https://en.lottolyzer.com/history/singapore/toto&#39;
html = requests.get(url).content
soup = BeautifulSoup(html, &#39;html.parser&#39;)

# Extract the initial DataFrame from the HTML table.
# Note that read_html returns a list of DataFrames (one for each table retrieved from the HTML), 
# so we need to select the first element (in the webpage there is only one table).
df_initial = pd.read_html(html)[0]

# Find the HTML table containing the winning numbers
table = soup.find(&#39;table&#39;, class_=&#39;responsive-table&#39;)

# Extract the &quot;Winning No&quot; column and add commas
# The last row of this block is used to preserve the format of the original HTML table,
# which reports the column label also into the last line.
winning_no = table.find_all(&#39;td&#39;, class_=&#39;sum-p1&#39;)[1::3]
winning_no = [num.text for num in winning_no]
winning_no.append(winning_no.pop(0))

# Create a DataFrame with the &quot;Winning No&quot; column
df = pd.DataFrame({&#39;Winning No.&#39;: [num for num in winning_no]})

# Replacing the &quot;Winning No&quot; column in the initial DataFrame with the new parsed column.
df_initial[&#39;Winning No.&#39;] = df[&#39;Winning No.&#39;]
print(df_initial)

TL;DR:

the line winning_no = table.find_all(&#39;td&#39;, class_=&#39;sum-p1&#39;)[1::3] finds all the table rows with class sum-p1 which by inspecting the results give this output:

    &lt;td class=&quot;sum-p1&quot;&gt;Date&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;Winning No.&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;Addl No.&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;2023-05-08&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;1,3,6,11,23,34&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;39&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;2023-05-04&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;16,21,25,28,37,44&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;24&lt;/td&gt;, 
    ...
]

As you can see the Winning No. elements starts from the element of index 1 of this array and are reachable by accessing the array starting from that index to the end with step 3 ([1::3]).

This line instead winning_no.append(winning_no.pop(0)) is used to remove the Header, i.e. the string Winning No. from the head of the list and put it in the last position. In this way when in the next step we'll create back the dataframe we will preserve the original structure of the table, having the label of the column both on the top of the table and in its last row.

huangapple
  • 本文由 发表于 2023年5月11日 15:04:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76224920.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定