2023年4月19日 18:33:36go评论77阅读模式

英文:

pd.Timestamp is assimilated to dtype('O')?

问题

schema = {'string_data': np.dtype(0), 'power': float, 'profile': str}
empty_df = return_empty_dataframe(schema)  # works

英文:

I developed a simple function which returns an empty dataframe with correct column names and dtypes from a dictionary:

import pandas as pd

def return_empty_dataframe(schema: dict[str, np.dtype]): 
    return pd.DataFrame(columns=schema.keys()).astype(schema)

which is used like that:

import numpy as np 

schema = {&#39;time&#39;: np.datetime64, &#39;power&#39;: float, &#39;profile&#39;: str}  
empty_df = return_empty_dataframe(schema)

I wanted to add the possibility to define a column to be of type pd.Timestamp. As pandas does not understand its own type and requires a timestamp column to be of type np.datetime64, I added the next code snippet to my function (to convert pd.Timestamp to np.datetime64 in the schema used to build the dataframe):

def return_empty_dataframe(schema: dict[str, np.dtype]): 
    dict_col_types_no_timestamp = {key: val for key, val in schema.items() if val != pd.Timestamp} 
    dict_col_types_just_timestamp = {key: np.datetime64 for key, val in schema.items() if val == pd.Timestamp} 
    dict_col_types = dict_col_types_no_timestamp | dict_col_types_just_timestamp

    return pd.DataFrame(columns=dict_col_types.keys()).astype(dict_col_types)

and so far so good, I can define my columns to be of type pd.Timestamp

schema = {&#39;time&#39;: pd.Timestamp, &#39;power&#39;: float, &#39;profile&#39;: str}  
empty_df = return_empty_dataframe(schema)  # works

However, I have a problem when I use this function with some automatic column type detection, as it seems columns of dtype object (dtype('O')) are interpreted as pd.Timestamp.

To check that:

pd.Timestamp == np.dtype(&#39;O&#39;)  # usually I have dtype(&#39;O&#39;) for string, or mixed types
&gt; True

Is that a regular behaviour ?

It is a problem for me, as for instance

schema = {&#39;string_data&#39;: np.dtype(0), &#39;power&#39;: float, &#39;profile&#39;: str}  
empty_df = return_empty_dataframe(schema)  # works

and the column string_data is turned into a np.datetime64 column.

答案1

得分: 1

通常情况下，您所经历的问题与NumPy数据类型有关，而Pandas长时间以来一直需要处理这个问题。基本上，除了数字以外的所有内容都被视为对象。

随着Pandas 2.0的推出，现在支持不同的数据类型，主要由Pyarrow提供。因此，您现在可以更明确地指定数据类型，因为有更多的可用选项供您选择。

使用pyarrow，这可以毫不费力地实现：

import pandas as pd
import numpy as np
import pyarrow as pa

def return_empty_dataframe(schema: dict[str, np.dtype]): 
    return pd.DataFrame(columns=schema.keys()).astype(schema)

schema = {'time': 'time64[us][pyarrow]', 'power': 'float32[pyarrow]', 'profile': 'string[pyarrow]'}  
empty_df = return_empty_dataframe(schema)

我不知道这是否能解决您的问题，而且这个pyarrow后端仍然处于早期开发阶段，但这是要记住的一点，它将使您更容易访问特定的数据类型。

英文:

Generally what you are experiencing is a big issue with numpy dtypes which pandas had to deal with for a long time. Basically everything that isn't a number is an object.

With pandas 2.0 there is now support for different datatypes namely provided by Pyarrow. So you can now be much more explicit with the datatypes since there are more available to you.

With pyarrow this works without any problems:

import pandas as pd
import numpy as np
import pyarrow as pa

def return_empty_dataframe(schema: dict[str, np.dtype]): 
    return pd.DataFrame(columns=schema.keys()).astype(schema)

schema = {&#39;time&#39;: &#39;time64[us][pyarrow]&#39;, &#39;power&#39;: &#39;float32[pyarrow]&#39;, &#39;profile&#39;: &#39;string[pyarrow]&#39;}  
empty_df = return_empty_dataframe(schema)

I don't know if this will solve your problems, and this pyarrow backend is still early in development but it's something to keep in mind that this will give you better access to specific datatypes.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pd.Timestamp 被转化为 dtype(‘O’) 吗？

问题

答案1

纳斯达克首次公开募股数据抓取

如何提供两种不同的实例化方式

无法打开prototxt文件。

多个向量在数组中的张量积

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论