2023年2月10日 11:03:27go评论65阅读模式

英文:

Pandas is Reading .xlsx Column as Datetime rather than float

问题

以下是您要翻译的内容：

"I obtained an Excel file with complicated formatting for some cells. Here is a sample:

The "USDC Amount USDC" column has formatting of "General" for the header cell, and the following for cells C2 through C6:

I need to read this column into pandas as a float value. However, when I use

import pandas
df = pandas.read_excel('Book1.xlsx')
print(['USDC Amount USDC'])
print(df['USDC Amount USDC'])

I get

['USDC Amount USDC']
0                          NaT
1   1927-06-05 05:38:32.726400
2   1872-07-25 18:21:27.273600
3                          NaT
4                          NaT
Name: USDC Amount USDC, dtype: datetime64[ns]

I do not want these as datetimes, I want them as floats! If I remove the complicated formatting in the Excel document (change it to "general" in column C), they are read in as float values, like this, which is what I want:

['USDC Amount USDC']
0             NaN
1    10018.235101
2   -10018.235101
3             NaN
4             NaN
Name: USDC Amount USDC, dtype: float64

The problem is that I have to download these Excel documents on a regular basis, and cannot modify them from the source. I have to get Pandas to understand (or ignore) this formatting and interpret the value as a float on its own.

I'm on Pandas 1.4.4, Windows 10, and Python 3.8. Any idea how to fix this? I cannot change the source Excel file, all the processing must be done in the Python script.

EDIT:

I added the sample Excel document in my comment below to download for reference. Also, here are some other package versions in case these matter:

openpyxl==3.0.3
xlrd==1.2.0
XlsxWriter==1.2.8

英文:

I obtained an Excel file with complicated formatting for some cells. Here is a sample:

The "USDC Amount USDC" column has formatting of "General" for the header cell, and the following for cells C2 through C6:

I need to read this column into pandas as a float value. However, when I use

import pandas
df = pandas.read_excel(&#39;Book1.xlsx&#39;)
print([&#39;USDC Amount USDC&#39;])
print(df[&#39;USDC Amount USDC&#39;])

I get

[&#39;USDC Amount USDC&#39;]
0                          NaT
1   1927-06-05 05:38:32.726400
2   1872-07-25 18:21:27.273600
3                          NaT
4                          NaT
Name: USDC Amount USDC, dtype: datetime64[ns]

[&#39;USDC Amount USDC&#39;]
0             NaN
1    10018.235101
2   -10018.235101
3             NaN
4             NaN
Name: USDC Amount USDC, dtype: float64

I'm on Pandas 1.4.4, Windows 10, and Python 3.8. Any idea how to fix this? I cannot change the source Excel file, all the processing must be done in the Python script.

EDIT:

I added the sample Excel document in my comment below to download for reference. Also, here are some other package versions in case these matter:

openpyxl==3.0.3
xlrd==1.2.0
XlsxWriter==1.2.8

答案1

得分: 1

从3.0.3更新到3.1.0解决了这个问题。快速查看更改日志似乎与 bugfix 1413 或 1500 有关。

英文:

It appears updating OpenPyXL from 3.0.3 to 3.1.0 resolved this issue. A quick glance at the changelog (https://openpyxl.readthedocs.io/en/stable/changes.html) suggests it appears to be related to bugfix 1413 or 1500.

答案2

得分: 0

你可以在read_excel中使用dtype参数，如下所示：

import numpy as np
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC': np.float64})

但这可能会带来一些问题。特别是，你的源数据包含无法转换为浮点数的字符。你的下一个最佳选择是使用object或string数据类型。因此，你可以这样做：

df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC': "string"})

之后，你需要从列中提取数字。你可以参考这个资源，了解整个过程的基本思路，但具体的方法由你决定。

最后，你可能想将现在只包含数字的列转换为浮点数。你可以使用内置的类型转换方法，如下所示：

df["numbers_only"] = df["numbers_only"].astype(np.float64)

英文:

You could use the dtype input in read_excel to be along the lines of

import numpy as np
df = pandas.read_excel(&#39;Book1.xlsx&#39;, dtype={&#39;USDC Amount USDC&#39;:np.float64})

but that comes with some issues. Particularly, your source data contains characters that can't be casted into a float. Your next best options are the object or string dtypes. So instead of :np.float64, you would do something like :"string" instead, resulting in

df = pandas.read_excel(&#39;Book1.xlsx&#39;, dtype={&#39;USDC Amount USDC&#39;:&quot;string&quot;})

After that, you need to extract the numbers from the column. Here's a resource that could help you get an idea of the overall process, although the exact method of doing so is up to you.

Finally, you would want to convert the now numbers-only column to floats. You can do it with the inbuilt casting which is

df[&quot;numbers_only&quot;] = df[&quot;numbers_only&quot;].astype(np.float64)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas将.xlsx列读取为日期时间而不是浮点数。

问题

答案1

答案2

如何在 Google Cloud 平台中安排标准 App Engine 的启停？

有一种方法可以生成组合并增加数值吗？

使用颜色谱在散点图中显示信息。

Python嵌套的if与and行为

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论