pd.Timestamp 被转化为 dtype(‘O’) 吗?

huangapple go评论77阅读模式
英文:

pd.Timestamp is assimilated to dtype('O')?

问题

schema = {'string_data': np.dtype(0), 'power': float, 'profile': str}
empty_df = return_empty_dataframe(schema)  # works
英文:

I developed a simple function which returns an empty dataframe with correct column names and dtypes from a dictionary:

import pandas as pd

def return_empty_dataframe(schema: dict[str, np.dtype]): 
    return pd.DataFrame(columns=schema.keys()).astype(schema)

which is used like that:

import numpy as np 

schema = {'time': np.datetime64, 'power': float, 'profile': str}  
empty_df = return_empty_dataframe(schema)

I wanted to add the possibility to define a column to be of type pd.Timestamp. As pandas does not understand its own type and requires a timestamp column to be of type np.datetime64, I added the next code snippet to my function (to convert pd.Timestamp to np.datetime64 in the schema used to build the dataframe):

def return_empty_dataframe(schema: dict[str, np.dtype]): 
    dict_col_types_no_timestamp = {key: val for key, val in schema.items() if val != pd.Timestamp} 
    dict_col_types_just_timestamp = {key: np.datetime64 for key, val in schema.items() if val == pd.Timestamp} 
    dict_col_types = dict_col_types_no_timestamp | dict_col_types_just_timestamp

    return pd.DataFrame(columns=dict_col_types.keys()).astype(dict_col_types)

and so far so good, I can define my columns to be of type pd.Timestamp

schema = {'time': pd.Timestamp, 'power': float, 'profile': str}  
empty_df = return_empty_dataframe(schema)  # works

However, I have a problem when I use this function with some automatic column type detection, as it seems columns of dtype object (dtype('O')) are interpreted as pd.Timestamp.

To check that:

pd.Timestamp == np.dtype('O')  # usually I have dtype('O') for string, or mixed types
> True

Is that a regular behaviour ?

It is a problem for me, as for instance

schema = {'string_data': np.dtype(0), 'power': float, 'profile': str}  
empty_df = return_empty_dataframe(schema)  # works

and the column string_data is turned into a np.datetime64 column.

答案1

得分: 1

通常情况下,您所经历的问题与NumPy数据类型有关,而Pandas长时间以来一直需要处理这个问题。基本上,除了数字以外的所有内容都被视为对象。

随着Pandas 2.0的推出,现在支持不同的数据类型,主要由Pyarrow提供。因此,您现在可以更明确地指定数据类型,因为有更多的可用选项供您选择。

使用pyarrow,这可以毫不费力地实现:

import pandas as pd
import numpy as np
import pyarrow as pa

def return_empty_dataframe(schema: dict[str, np.dtype]): 
    return pd.DataFrame(columns=schema.keys()).astype(schema)

schema = {'time': 'time64[us][pyarrow]', 'power': 'float32[pyarrow]', 'profile': 'string[pyarrow]'}  
empty_df = return_empty_dataframe(schema)

我不知道这是否能解决您的问题,而且这个pyarrow后端仍然处于早期开发阶段,但这是要记住的一点,它将使您更容易访问特定的数据类型。

英文:

Generally what you are experiencing is a big issue with numpy dtypes which pandas had to deal with for a long time. Basically everything that isn't a number is an object.

With pandas 2.0 there is now support for different datatypes namely provided by Pyarrow. So you can now be much more explicit with the datatypes since there are more available to you.

With pyarrow this works without any problems:

import pandas as pd
import numpy as np
import pyarrow as pa

def return_empty_dataframe(schema: dict[str, np.dtype]): 
    return pd.DataFrame(columns=schema.keys()).astype(schema)

schema = {'time': 'time64[us][pyarrow]', 'power': 'float32[pyarrow]', 'profile': 'string[pyarrow]'}  
empty_df = return_empty_dataframe(schema)

I don't know if this will solve your problems, and this pyarrow backend is still early in development but it's something to keep in mind that this will give you better access to specific datatypes.

huangapple
  • 本文由 发表于 2023年4月19日 18:33:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76053482.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定