如何防止将numpy ndarray列转换为字符串,当将Pandas DataFrame保存为csv时?

huangapple go评论89阅读模式
英文:

How do I prevent a numpy ndarray column from being converted to string when saving a Pandas DataFrame to csv?

问题

我有一个包含"ID"列和"Vector"(包含(1,500)大小的数组)列的DataFrame。我必须将DF保存为csv。当我将保存的csv转换为DF时,数组变成了字符串,我无法再使用它进行操作。

例如,在保存DF之前,向量列如下所示:

>> DataFrame_example["Vector"][0]
Out:

array([[-4.51561287e-02, -5.02060959e-03,  1.01038935e-02,
        -3.24810972e-03,  8.50208327e-02, -3.12430300e-02,
        -3.06447037e-02, -6.82420060e-02,  4.08798642e-02
             ...........................................
        -6.08731210e-02,  4.24617827e-02,  2.90670991e-02,
         1.87119041e-02,  5.67540973e-02,  4.65381369e-02,
         3.42479758e-02,  9.88676678e-03, -1.62497200e-02,
         1.46159781e-02, -6.39008060e-02]], dtype=float32)

>> type(DataFrame_example["Vector"][0])
Out: numpy.ndarray

但是在保存为csv并重新读取后,相同的块输出变成了:

>> DataFrame_example["Vector"][0]

'[[-4.51561287e-02 -5.02060959e-03  1.01038935e-02 -3.24810972e-03\n   8.50208327e-02 -3.12430300e-02 -3.06447037e-02 -6.82420060e-02\n   4.08798642e-02  2.49120360e-03 -6.40684515e-02  
 ............................................................................................
-5.22072986e-02\n   6.16791770e-02 -8.88353493e-03  1.65628344e-02 -5.95084354e-02\n  -8.45786110e-02 -8.65871832e-03  3.98499370e-02 -3.41838486e-02\n  -2.02250257e-02  5.18149361e-02 -5.80132604e-02  7.66506651e-03\n  -5.49656115e-02 -6.08731210e-02  4.24617827e-02  2.90670991e-02\n   1.87119041e-02  5.67540973e-02  4.65381369e-02  3.42479758e-02\n   9.88676678e-03 -1.62497200e-02  1.46159781e-02 -6.39008060e-02]]'

如何保留格式,任何帮助将不胜感激。

我以csv格式保存DF:

compression_opts = dict(method='zip',
                        archive_name=save_name+'.csv')
DataFrame_example.to_csv(save_name+'.zip', index=False,
          compression=compression_opts)  

我用以下方式读取它:

DataFrame_example = read_csv("example.csv")

我尝试使用delimiter=","或sep=","也尝试了。

英文:

I have an DataFrame which is including an "ID" column and "Vector"(which includes (1,500) sized arrays) column. I have to save the DF as csv. When I convert the saved csv to DF again; the array becomes string and I could not use it with the functions anymore.

For example before saving the DF vector column is like:

>>DataFrame_example["Vector"][0]
Out:

array([[-4.51561287e-02, -5.02060959e-03,  1.01038935e-02,
        -3.24810972e-03,  8.50208327e-02, -3.12430300e-02,
        -3.06447037e-02, -6.82420060e-02,  4.08798642e-02
             ...........................................
        -6.08731210e-02,  4.24617827e-02,  2.90670991e-02,
         1.87119041e-02,  5.67540973e-02,  4.65381369e-02,
         3.42479758e-02,  9.88676678e-03, -1.62497200e-02,
         1.46159781e-02, -6.39008060e-02]], dtype=float32)


>>type(DataFrame_example["Vector"][0])
Out: numpy.ndarray


But after saving as csv and read it again same block output becomes;

>>DataFrame_example["Vector"][0]

'[[-4.51561287e-02 -5.02060959e-03  1.01038935e-02 -3.24810972e-03\n   8.50208327e-02 -3.12430300e-02 -3.06447037e-02 -6.82420060e-02\n   4.08798642e-02  2.49120360e-03 -6.40684515e-02  
 ............................................................................................
-5.22072986e-02\n   6.16791770e-02 -8.88353493e-03  1.65628344e-02 -5.95084354e-02\n  -8.45786110e-02 -8.65871832e-03  3.98499370e-02 -3.41838486e-02\n  -2.02250257e-02  5.18149361e-02 -5.80132604e-02  7.66506651e-03\n  -5.49656115e-02 -6.08731210e-02  4.24617827e-02  2.90670991e-02\n   1.87119041e-02  5.67540973e-02  4.65381369e-02  3.42479758e-02\n   9.88676678e-03 -1.62497200e-02  1.46159781e-02 -6.39008060e-02]]'

How can I keep the format, any help would appreciated.

I am saving the DF in csv format;

compression_opts = dict(method='zip',
                        archive_name=save_name+'.csv')
DataFrame_example.to_csv(save_name+'.zip', index=False,
          compression=compression_opts)  
 

I am reading it with;

DataFrame_example=read_csv("example.csv")

I have triedreading it with deliiter="," or sep="," also.

答案1

得分: 0

你需要从字符串中去掉括号,然后按空格分割结果。这将给你一个字符串数组,你可以将其转换为浮点数。

bracket_strip = str.maketrans('', '', '[]')

new_column = []
for vector in DataFrame_example.vector:
    print(vector)
    vector = vector.translate(bracket_strip).split(' ')

    new_vector = []
    for val in vector:
        new_vector.append(float(val))

    new_column.append(new_vector)

DataFrame_example.vector = new_column

类似这样的代码应该能完成任务。我只是将变量名更改为您的变量名。

英文:

You need to strip the brackets from that string, and split the result by spaces. This will give you an array of strings that you can cast to floats.

bracket_strip = str.maketrans("","","[]")

new_column = []
for vector in DataFrame_example.vector:
    print(vector)
    vector = vector.translate(bracket_strip).split(" ")

    new_vector = []
    for val in vector: 
        new_vector.append(float(val))

    new_column.append(new_vector)

DataFrame_example.vector = new_column

Something like that should do the trick. I just changed the variable names to yours.

答案2

得分: 0

你可以使用 pandas.DataFrame.to_pickle 代替:

df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [np.array([5, 6, 7, 8]), np.array([5, 6, 7, 8]), np.array([5, 6, 7, 8]), np.array([5, 6, 7, 8])]})

#   a             b
#0  1  [5, 6, 7, 8]
#1  2  [5, 6, 7, 8]
#2  3  [5, 6, 7, 8]
#3  4  [5, 6, 7, 8]

type(df.b[0])
#<class 'numpy.ndarray'>

df.to_pickle("out.txt")

new = pd.read_pickle("out.txt")
type(new.b[0])
#<class 'numpy.ndarray'>
英文:

You can use pandas.DataFrame.to_pickle instead:

df = pd.DataFrame({'a': [1,2,3,4], 'b' : [np.array([5,6,7,8]), np.array([5,6,7,8]),np.array([5,6,7,8]),np.array([5,6,7,8])]})

#   a             b
#0  1  [5, 6, 7, 8]
#1  2  [5, 6, 7, 8]
#2  3  [5, 6, 7, 8]
#3  4  [5, 6, 7, 8]

type(df.b[0])
#<class 'numpy.ndarray'>

df.to_pickle("out.txt")

new = pd.read_pickle("out.txt")
type(new.b[0])
#<class 'numpy.ndarray'>

答案3

得分: 0

如果您想使用.csv文件,那么我建议您在读取时使用dtype参数将其转换为相应的数据类型:

dtypeType名称或列 ->类型的字典可选
数据或列的数据类型例如{'a': np.float64, 'b': np.int32, 'c': 'Int64'}使用str或object以及适当的na_values设置以保留并不解释dtype如果指定了转换器它们将代替dtype转换应用

否则,您可以使用另一种类型的文件保存您的数据(parquet、pickel...)。这可以使用pandas实现:pandas to_parquet

df.to_parquet('df.parquet.gzip',
              compression='gzip')  
pd.read_parquet('df.parquet.gzip')

后者通常在性能方面更好!

英文:

If you want to use a .csv file, then I would suggest that you convert to the corresponding datatype on read using the dtype argument :

> dtypeType name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Otherwise, you should save your data using another type of file (parquet, pickel...). This can be achieved using pandas : pandas to_parquet.

df.to_parquet('df.parquet.gzip',
              compression='gzip')  
pd.read_parquet('df.parquet.gzip')

The latter is often a better option performance-wise!

huangapple
  • 本文由 发表于 2023年5月26日 17:05:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76339292.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定