英文:
Pandas is Reading .xlsx Column as Datetime rather than float
问题
以下是您要翻译的内容:
"I obtained an Excel file with complicated formatting for some cells. Here is a sample:
The "USDC Amount USDC" column has formatting of "General" for the header cell, and the following for cells C2 through C6:
I need to read this column into pandas as a float value. However, when I use
import pandas
df = pandas.read_excel('Book1.xlsx')
print(['USDC Amount USDC'])
print(df['USDC Amount USDC'])
I get
['USDC Amount USDC']
0 NaT
1 1927-06-05 05:38:32.726400
2 1872-07-25 18:21:27.273600
3 NaT
4 NaT
Name: USDC Amount USDC, dtype: datetime64[ns]
I do not want these as datetimes, I want them as floats! If I remove the complicated formatting in the Excel document (change it to "general" in column C), they are read in as float values, like this, which is what I want:
['USDC Amount USDC']
0 NaN
1 10018.235101
2 -10018.235101
3 NaN
4 NaN
Name: USDC Amount USDC, dtype: float64
The problem is that I have to download these Excel documents on a regular basis, and cannot modify them from the source. I have to get Pandas to understand (or ignore) this formatting and interpret the value as a float on its own.
I'm on Pandas 1.4.4, Windows 10, and Python 3.8. Any idea how to fix this? I cannot change the source Excel file, all the processing must be done in the Python script.
EDIT:
I added the sample Excel document in my comment below to download for reference. Also, here are some other package versions in case these matter:
openpyxl==3.0.3
xlrd==1.2.0
XlsxWriter==1.2.8
英文:
I obtained an Excel file with complicated formatting for some cells. Here is a sample:
The "USDC Amount USDC" column has formatting of "General" for the header cell, and the following for cells C2 through C6:
I need to read this column into pandas as a float value. However, when I use
import pandas
df = pandas.read_excel('Book1.xlsx')
print(['USDC Amount USDC'])
print(df['USDC Amount USDC'])
I get
['USDC Amount USDC']
0 NaT
1 1927-06-05 05:38:32.726400
2 1872-07-25 18:21:27.273600
3 NaT
4 NaT
Name: USDC Amount USDC, dtype: datetime64[ns]
I do not want these as datetimes, I want them as floats! If I remove the complicated formatting in the Excel document (change it to "general" in column C), they are read in as float values, like this, which is what I want:
['USDC Amount USDC']
0 NaN
1 10018.235101
2 -10018.235101
3 NaN
4 NaN
Name: USDC Amount USDC, dtype: float64
The problem is that I have to download these Excel documents on a regular basis, and cannot modify them from the source. I have to get Pandas to understand (or ignore) this formatting and interpret the value as a float on its own.
I'm on Pandas 1.4.4, Windows 10, and Python 3.8. Any idea how to fix this? I cannot change the source Excel file, all the processing must be done in the Python script.
EDIT:
I added the sample Excel document in my comment below to download for reference. Also, here are some other package versions in case these matter:
openpyxl==3.0.3
xlrd==1.2.0
XlsxWriter==1.2.8
答案1
得分: 1
从3.0.3
更新到3.1.0
解决了这个问题。快速查看更改日志似乎与 bugfix 1413 或 1500 有关。
英文:
It appears updating OpenPyXL from 3.0.3
to 3.1.0
resolved this issue. A quick glance at the changelog (https://openpyxl.readthedocs.io/en/stable/changes.html) suggests it appears to be related to bugfix 1413 or 1500.
答案2
得分: 0
你可以在read_excel
中使用dtype
参数,如下所示:
import numpy as np
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC': np.float64})
但这可能会带来一些问题。特别是,你的源数据包含无法转换为浮点数的字符。你的下一个最佳选择是使用object
或string
数据类型。因此,你可以这样做:
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC': "string"})
之后,你需要从列中提取数字。你可以参考这个资源,了解整个过程的基本思路,但具体的方法由你决定。
最后,你可能想将现在只包含数字的列转换为浮点数。你可以使用内置的类型转换方法,如下所示:
df["numbers_only"] = df["numbers_only"].astype(np.float64)
英文:
You could use the dtype input in read_excel to be along the lines of
import numpy as np
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':np.float64})
but that comes with some issues. Particularly, your source data contains characters that can't be casted into a float. Your next best options are the object or string dtypes. So instead of :np.float64, you would do something like :"string" instead, resulting in
df = pandas.read_excel('Book1.xlsx', dtype={'USDC Amount USDC':"string"})
After that, you need to extract the numbers from the column. Here's a resource that could help you get an idea of the overall process, although the exact method of doing so is up to you.
Finally, you would want to convert the now numbers-only column to floats. You can do it with the inbuilt casting which is
df["numbers_only"] = df["numbers_only"].astype(np.float64)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论