英文:
How do I prevent a numpy ndarray column from being converted to string when saving a Pandas DataFrame to csv?
问题
我有一个包含"ID"列和"Vector"(包含(1,500)大小的数组)列的DataFrame。我必须将DF保存为csv。当我将保存的csv转换为DF时,数组变成了字符串,我无法再使用它进行操作。
例如,在保存DF之前,向量列如下所示:
>> DataFrame_example["Vector"][0]
Out:
array([[-4.51561287e-02, -5.02060959e-03, 1.01038935e-02,
-3.24810972e-03, 8.50208327e-02, -3.12430300e-02,
-3.06447037e-02, -6.82420060e-02, 4.08798642e-02
...........................................
-6.08731210e-02, 4.24617827e-02, 2.90670991e-02,
1.87119041e-02, 5.67540973e-02, 4.65381369e-02,
3.42479758e-02, 9.88676678e-03, -1.62497200e-02,
1.46159781e-02, -6.39008060e-02]], dtype=float32)
>> type(DataFrame_example["Vector"][0])
Out: numpy.ndarray
但是在保存为csv并重新读取后,相同的块输出变成了:
>> DataFrame_example["Vector"][0]
'[[-4.51561287e-02 -5.02060959e-03 1.01038935e-02 -3.24810972e-03\n 8.50208327e-02 -3.12430300e-02 -3.06447037e-02 -6.82420060e-02\n 4.08798642e-02 2.49120360e-03 -6.40684515e-02
............................................................................................
-5.22072986e-02\n 6.16791770e-02 -8.88353493e-03 1.65628344e-02 -5.95084354e-02\n -8.45786110e-02 -8.65871832e-03 3.98499370e-02 -3.41838486e-02\n -2.02250257e-02 5.18149361e-02 -5.80132604e-02 7.66506651e-03\n -5.49656115e-02 -6.08731210e-02 4.24617827e-02 2.90670991e-02\n 1.87119041e-02 5.67540973e-02 4.65381369e-02 3.42479758e-02\n 9.88676678e-03 -1.62497200e-02 1.46159781e-02 -6.39008060e-02]]'
如何保留格式,任何帮助将不胜感激。
我以csv格式保存DF:
compression_opts = dict(method='zip',
archive_name=save_name+'.csv')
DataFrame_example.to_csv(save_name+'.zip', index=False,
compression=compression_opts)
我用以下方式读取它:
DataFrame_example = read_csv("example.csv")
我尝试使用delimiter=","或sep=","也尝试了。
英文:
I have an DataFrame which is including an "ID" column and "Vector"(which includes (1,500) sized arrays) column. I have to save the DF as csv. When I convert the saved csv to DF again; the array becomes string and I could not use it with the functions anymore.
For example before saving the DF vector column is like:
>>DataFrame_example["Vector"][0]
Out:
array([[-4.51561287e-02, -5.02060959e-03, 1.01038935e-02,
-3.24810972e-03, 8.50208327e-02, -3.12430300e-02,
-3.06447037e-02, -6.82420060e-02, 4.08798642e-02
...........................................
-6.08731210e-02, 4.24617827e-02, 2.90670991e-02,
1.87119041e-02, 5.67540973e-02, 4.65381369e-02,
3.42479758e-02, 9.88676678e-03, -1.62497200e-02,
1.46159781e-02, -6.39008060e-02]], dtype=float32)
>>type(DataFrame_example["Vector"][0])
Out: numpy.ndarray
But after saving as csv and read it again same block output becomes;
>>DataFrame_example["Vector"][0]
'[[-4.51561287e-02 -5.02060959e-03 1.01038935e-02 -3.24810972e-03\n 8.50208327e-02 -3.12430300e-02 -3.06447037e-02 -6.82420060e-02\n 4.08798642e-02 2.49120360e-03 -6.40684515e-02
............................................................................................
-5.22072986e-02\n 6.16791770e-02 -8.88353493e-03 1.65628344e-02 -5.95084354e-02\n -8.45786110e-02 -8.65871832e-03 3.98499370e-02 -3.41838486e-02\n -2.02250257e-02 5.18149361e-02 -5.80132604e-02 7.66506651e-03\n -5.49656115e-02 -6.08731210e-02 4.24617827e-02 2.90670991e-02\n 1.87119041e-02 5.67540973e-02 4.65381369e-02 3.42479758e-02\n 9.88676678e-03 -1.62497200e-02 1.46159781e-02 -6.39008060e-02]]'
How can I keep the format, any help would appreciated.
I am saving the DF in csv format;
compression_opts = dict(method='zip',
archive_name=save_name+'.csv')
DataFrame_example.to_csv(save_name+'.zip', index=False,
compression=compression_opts)
I am reading it with;
DataFrame_example=read_csv("example.csv")
I have triedreading it with deliiter="," or sep="," also.
答案1
得分: 0
你需要从字符串中去掉括号,然后按空格分割结果。这将给你一个字符串数组,你可以将其转换为浮点数。
bracket_strip = str.maketrans('', '', '[]')
new_column = []
for vector in DataFrame_example.vector:
print(vector)
vector = vector.translate(bracket_strip).split(' ')
new_vector = []
for val in vector:
new_vector.append(float(val))
new_column.append(new_vector)
DataFrame_example.vector = new_column
类似这样的代码应该能完成任务。我只是将变量名更改为您的变量名。
英文:
You need to strip the brackets from that string, and split the result by spaces. This will give you an array of strings that you can cast to floats.
bracket_strip = str.maketrans("","","[]")
new_column = []
for vector in DataFrame_example.vector:
print(vector)
vector = vector.translate(bracket_strip).split(" ")
new_vector = []
for val in vector:
new_vector.append(float(val))
new_column.append(new_vector)
DataFrame_example.vector = new_column
Something like that should do the trick. I just changed the variable names to yours.
答案2
得分: 0
你可以使用 pandas.DataFrame.to_pickle
代替:
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [np.array([5, 6, 7, 8]), np.array([5, 6, 7, 8]), np.array([5, 6, 7, 8]), np.array([5, 6, 7, 8])]})
# a b
#0 1 [5, 6, 7, 8]
#1 2 [5, 6, 7, 8]
#2 3 [5, 6, 7, 8]
#3 4 [5, 6, 7, 8]
type(df.b[0])
#<class 'numpy.ndarray'>
df.to_pickle("out.txt")
new = pd.read_pickle("out.txt")
type(new.b[0])
#<class 'numpy.ndarray'>
英文:
You can use pandas.DataFrame.to_pickle
instead:
df = pd.DataFrame({'a': [1,2,3,4], 'b' : [np.array([5,6,7,8]), np.array([5,6,7,8]),np.array([5,6,7,8]),np.array([5,6,7,8])]})
# a b
#0 1 [5, 6, 7, 8]
#1 2 [5, 6, 7, 8]
#2 3 [5, 6, 7, 8]
#3 4 [5, 6, 7, 8]
type(df.b[0])
#<class 'numpy.ndarray'>
df.to_pickle("out.txt")
new = pd.read_pickle("out.txt")
type(new.b[0])
#<class 'numpy.ndarray'>
答案3
得分: 0
如果您想使用.csv文件,那么我建议您在读取时使用dtype参数将其转换为相应的数据类型:
dtypeType名称或列 ->类型的字典,可选
数据或列的数据类型。例如,{'a': np.float64, 'b': np.int32, 'c': 'Int64'}。使用str或object以及适当的na_values设置,以保留并不解释dtype。如果指定了转换器,它们将代替dtype转换应用。
否则,您可以使用另一种类型的文件保存您的数据(parquet、pickel...)。这可以使用pandas实现:pandas to_parquet。
df.to_parquet('df.parquet.gzip',
compression='gzip')
pd.read_parquet('df.parquet.gzip')
后者通常在性能方面更好!
英文:
If you want to use a .csv file, then I would suggest that you convert to the corresponding datatype on read using the dtype argument :
> dtypeType name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
Otherwise, you should save your data using another type of file (parquet, pickel...). This can be achieved using pandas : pandas to_parquet.
df.to_parquet('df.parquet.gzip',
compression='gzip')
pd.read_parquet('df.parquet.gzip')
The latter is often a better option performance-wise!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论