Pandas将.xlsx列读取为日期时间而不是浮点数。

huangapple go评论61阅读模式
英文:

Pandas is Reading .xlsx Column as Datetime rather than float

问题

以下是您要翻译的内容:

"I obtained an Excel file with complicated formatting for some cells. Here is a sample:

The "USDC Amount USDC" column has formatting of "General" for the header cell, and the following for cells C2 through C6:

I need to read this column into pandas as a float value. However, when I use

import pandas
df = pandas.read_excel('Book1.xlsx')
print(['USDC Amount USDC'])
print(df['USDC Amount USDC'])

I get

['USDC Amount USDC']
0                          NaT
1   1927-06-05 05:38:32.726400
2   1872-07-25 18:21:27.273600
3                          NaT
4                          NaT
Name: USDC Amount USDC, dtype: datetime64[ns]

I do not want these as datetimes, I want them as floats! If I remove the complicated formatting in the Excel document (change it to "general" in column C), they are read in as float values, like this, which is what I want:

['USDC Amount USDC']
0             NaN
1    10018.235101
2   -10018.235101
3             NaN
4             NaN
Name: USDC Amount USDC, dtype: float64

The problem is that I have to download these Excel documents on a regular basis, and cannot modify them from the source. I have to get Pandas to understand (or ignore) this formatting and interpret the value as a float on its own.

I'm on Pandas 1.4.4, Windows 10, and Python 3.8. Any idea how to fix this? I cannot change the source Excel file, all the processing must be done in the Python script.

EDIT:

I added the sample Excel document in my comment below to download for reference. Also, here are some other package versions in case these matter:

openpyxl==3.0.3
xlrd==1.2.0
XlsxWriter==1.2.8
英文:

I obtained an Excel file with complicated formatting for some cells. Here is a sample:

Pandas将.xlsx列读取为日期时间而不是浮点数。

The "USDC Amount USDC" column has formatting of "General" for the header cell, and the following for cells C2 through C6:

Pandas将.xlsx列读取为日期时间而不是浮点数。

I need to read this column into pandas as a float value. However, when I use

import pandas
df = pandas.read_excel('Book1.xlsx')
print(['USDC Amount USDC'])
print(df['USDC Amount USDC'])

I get

['USDC Amount USDC']
0                          NaT
1   1927-06-05 05:38:32.726400
2   1872-07-25 18:21:27.273600
3                          NaT
4                          NaT
Name: USDC Amount USDC, dtype: datetime64[ns]

I do not want these as datetimes, I want them as floats! If I remove the complicated formatting in the Excel document (change it to "general" in column C), they are read in as float values, like this, which is what I want:

['USDC Amount USDC']
0             NaN
1    10018.235101
2   -10018.235101
3             NaN
4             NaN
Name: USDC Amount USDC, dtype: float64

The problem is that I have to download these Excel documents on a regular basis, and cannot modify them from the source. I have to get Pandas to understand (or ignore) this formatting and interpret the value as a float on its own.

I'm on Pandas 1.4.4, Windows 10, and Python 3.8. Any idea how to fix this? I cannot change the source Excel file, all the processing must be done in the Python script.

EDIT:

I added the sample Excel document in my comment below to download for reference. Also, here are some other package versions in case these matter:

openpyxl==3.0.3
xlrd==1.2.0
XlsxWriter==1.2.8

答案1

得分: 1

3.0.3更新到3.1.0解决了这个问题。快速查看更改日志似乎与 bugfix 1413 或 1500 有关。

英文:

It appears updating OpenPyXL from 3.0.3 to 3.1.0 resolved this issue. A quick glance at the changelog (https://openpyxl.readthedocs.io/en/stable/changes.html) suggests it appears to be related to bugfix 1413 or 1500.

答案2

得分: 0

你可以在read_excel中使用dtype参数,如下所示:

import numpy as np
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC': np.float64})

但这可能会带来一些问题。特别是,你的源数据包含无法转换为浮点数的字符。你的下一个最佳选择是使用objectstring数据类型。因此,你可以这样做:

df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC': "string"})

之后,你需要从列中提取数字。你可以参考这个资源,了解整个过程的基本思路,但具体的方法由你决定。

最后,你可能想将现在只包含数字的列转换为浮点数。你可以使用内置的类型转换方法,如下所示:

df["numbers_only"] = df["numbers_only"].astype(np.float64)
英文:

You could use the dtype input in read_excel to be along the lines of

import numpy as np
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':np.float64})

but that comes with some issues. Particularly, your source data contains characters that can't be casted into a float. Your next best options are the object or string dtypes. So instead of :np.float64, you would do something like :"string" instead, resulting in

df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':"string"})

After that, you need to extract the numbers from the column. Here's a resource that could help you get an idea of the overall process, although the exact method of doing so is up to you.

Finally, you would want to convert the now numbers-only column to floats. You can do it with the inbuilt casting which is

df["numbers_only"] = df["numbers_only"].astype(np.float64)

huangapple
  • 本文由 发表于 2023年2月10日 11:03:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/75406542.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定