2023年5月11日 15:04:57go评论68阅读模式

英文:

Python pandas reading table data but commas is removed

问题

如何获取“Winning No”列的原始数据，以保留逗号而不移除？

结果应该是1,45,13,12,13而不是145131213。

以下是我的代码：

import requests
import pandas as pd

pd.options.display.width = 0
pd.options.display.max_rows = 1000

url = 'https://en.lottolyzer.com/history/singapore/toto'
html = requests.get(url).content
df_list = pd.read_html(html)

df = df_list[-1]
print(df)
df.to_csv('my data.csv')

英文:

how do I have the original data for column "Winning No". How to have the comma not removed?

Result should be 1,45,13,12,13 instead of 145131213

Here my code:

import requests
import pandas as pd

pd.options.display.width = 0
pd.options.display.max_rows = 1000

url = &#39;https://en.lottolyzer.com/history/singapore/toto&#39;
html = requests.get(url).content
df_list = pd.read_html(html)



df = df_list[-1]
print(df)
df.to_csv(&#39;my data.csv&#39;)

答案1

得分: 0

以下是您要翻译的内容：

通过查看read_html文档，我发现参数decimal和thousands在这些情况下可能会有帮助，但在这个特定情况下不起作用。鉴于此，我提出了一种自定义解决方案，还利用了BeautifulSoup库以更轻松地解析HTML结果。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.lottolyzer.com/history/singapore/toto'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

# 从HTML表中提取初始DataFrame。
# 请注意，read_html返回一个DataFrame列表（从HTML检索的每个表格都有一个）， 
# 所以我们需要选择第一个元素（在网页上只有一个表格）。
df_initial = pd.read_html(html)[0]

# 查找包含中奖号码的HTML表
table = soup.find('table', class_='responsive-table')

# 提取“中奖号码”列并添加逗号
# 此块的最后一行用于保留原始HTML表格的格式，
# 原始HTML表格在最后一行中还报告了列标签。
winning_no = table.find_all('td', class_='sum-p1')[1::3]
winning_no = [num.text for num in winning_no]
winning_no.append(winning_no.pop(0))

# 使用“中奖号码”列创建DataFrame
df = pd.DataFrame({'中奖号码': [num for num in winning_no]})

# 用新解析的列替换初始DataFrame中的“中奖号码”列。
df_initial['中奖号码'] = df['中奖号码']
print(df_initial)

简而言之：

这行代码 winning_no = table.find_all('td', class_='sum-p1')[1::3] 查找所有带有类sum-p1的表行，通过检查结果，可以得到以下输出：

    <td class="sum-p1">Date</td>, 
    <td class="sum-p1">中奖号码</td>, 
    <td class="sum-p1">Addl No.</td>, 
    <td class="sum-p1">2023-05-08</td>, 
    <td class="sum-p1">1,3,6,11,23,34</td>, 
    <td class="sum-p1">39</td>, 
    <td class="sum-p1">2023-05-04</td>, 
    <td class="sum-p1">16,21,25,28,37,44</td>, 
    <td class="sum-p1">24</td>, 
    ...
]

如您所见，中奖号码 元素从该数组的索引1的元素开始，并且可通过从该索引开始到末尾以3的步幅（[1::3]）访问数组来获取。

相比之下，这行代码 winning_no.append(winning_no.pop(0)) 用于删除标题，即从列表的开头删除字符串 中奖号码 并将其放在最后一个位置。这样，当我们在下一步中恢复DataFrame时，将保留表的原始结构，使列的标签既位于表格顶部又位于其最后一行。

英文:

By inspecting the read_html documentation I found out that the parameters decimal and thousands could be helpful in these cases, but they do not work in this specific one. With that being said, I propose a customized solution which makes use also of the BeautifulSoup library in order to parse the HTML result in an easier way.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = &#39;https://en.lottolyzer.com/history/singapore/toto&#39;
html = requests.get(url).content
soup = BeautifulSoup(html, &#39;html.parser&#39;)

# Extract the initial DataFrame from the HTML table.
# Note that read_html returns a list of DataFrames (one for each table retrieved from the HTML), 
# so we need to select the first element (in the webpage there is only one table).
df_initial = pd.read_html(html)[0]

# Find the HTML table containing the winning numbers
table = soup.find(&#39;table&#39;, class_=&#39;responsive-table&#39;)

# Extract the &quot;Winning No&quot; column and add commas
# The last row of this block is used to preserve the format of the original HTML table,
# which reports the column label also into the last line.
winning_no = table.find_all(&#39;td&#39;, class_=&#39;sum-p1&#39;)[1::3]
winning_no = [num.text for num in winning_no]
winning_no.append(winning_no.pop(0))

# Create a DataFrame with the &quot;Winning No&quot; column
df = pd.DataFrame({&#39;Winning No.&#39;: [num for num in winning_no]})

# Replacing the &quot;Winning No&quot; column in the initial DataFrame with the new parsed column.
df_initial[&#39;Winning No.&#39;] = df[&#39;Winning No.&#39;]
print(df_initial)

TL;DR:

the line winning_no = table.find_all('td', class_='sum-p1')[1::3] finds all the table rows with class sum-p1 which by inspecting the results give this output:

    &lt;td class=&quot;sum-p1&quot;&gt;Date&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;Winning No.&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;Addl No.&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;2023-05-08&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;1,3,6,11,23,34&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;39&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;2023-05-04&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;16,21,25,28,37,44&lt;/td&gt;, 
    &lt;td class=&quot;sum-p1&quot;&gt;24&lt;/td&gt;, 
    ...
]

As you can see the Winning No. elements starts from the element of index 1 of this array and are reachable by accessing the array starting from that index to the end with step 3 ([1::3]).

This line instead winning_no.append(winning_no.pop(0)) is used to remove the Header, i.e. the string Winning No. from the head of the list and put it in the last position. In this way when in the next step we'll create back the dataframe we will preserve the original structure of the table, having the label of the column both on the top of the table and in its last row.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python pandas读取表格数据，但逗号被移除。

问题

答案1

tkinter, wm_state('zoomed') does not work on my app

计算有多少个二维数组中的元素超过了该数组所有元素的算术平均值。

Find indices of array of values in a master array, when values are arrays.

Element found but it's not clicked and the test fails

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论