英文:
pandas: convert string column to array of float
问题
我有以下的数据框:
df = pd.DataFrame(
{'feature_idx': ['(4,)', '(4, 15)', '(1, 4, 15)', '(1, 4, 15, 176)', '(1, 4, 15, 89, 176)'],
'cv_scores': ['[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]',
'[0.79227296 0.82675175 0.83723801 0.92502134 0.82625185]',
'[0.82134069 0.84987581 0.86420576 0.93398567 0.84150328]',
'[0.83244816 0.86689598 0.87095624 0.9445071 0.85839512]',
'[0.84192526 0.87788764 0.87939774 0.95181742 0.86563099]']}
)
df.head(2)
feature_idx cv_scores
0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0...
1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0...
列 cv_scores
包含了5折交叉验证的分数字符串,例如:
>>> sub_df.iloc[0]['cv_scores']
'[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]'
>>> type(sub_df.iloc[0]['cv_scores'])
str
我想要添加一个名为 avg_score
的列,用于计算每个特征的平均分数(cv_scores / 5
)。
由于 cv_scores
是字符串,我需要一种将该列转换为浮点数数组的方法,以获取所需的列。
期望结果
>>> df_result
feature_idx cv_scores avg_score
0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0... 0.773186
1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0... 0.841507
2 (1, 4, 15) [0.82134069 0.84987581 0.86420576 0.93398567 0... 0.862182
3 (1, 4, 15, 176) [0.83244816 0.86689598 0.87095624 0.9445071 0... 0.874641
4 (1, 4, 15, 89, 176) [0.84192526 0.87788764 0.87939774 0.95181742 0... 0.883332
英文:
I have a the following df:
df = pd.DataFrame(
{'feature_idx': ['(4,)','(4, 15)','(1, 4, 15)',
'(1, 4, 15, 176)','(1, 4, 15, 89, 176)'],
'cv_scores': ['[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]',
'[0.79227296 0.82675175 0.83723801 0.92502134 0.82625185]',
'[0.82134069 0.84987581 0.86420576 0.93398567 0.84150328]',
'[0.83244816 0.86689598 0.87095624 0.9445071 0.85839512]',
'[0.84192526 0.87788764 0.87939774 0.95181742 0.86563099]']}
)
df.head(2)
feature_idx cv_scores
0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0...
1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0...
The column cv_scores
contains string for the scores of 5-fold, so for example.
>>> sub_df.iloc[0]['cv_scores']
'[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]'
>>> type(sub_df.iloc[0]['cv_scores'])
str
I would like to add a column avg_score
for the average score of each feature (cv_scores / 5
).
Since cv_scores
is a string, I need a way to convert this column to array of float, to derive the intended column.
Expected results
>>> df_result
feature_idx cv_scores avg_score
0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0... 0.773186
1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0... 0.841507
2 (1, 4, 15) [0.82134069 0.84987581 0.86420576 0.93398567 0... 0.862182
3 (1, 4, 15, 176) [0.83244816 0.86689598 0.87095624 0.9445071 0... 0.874641
4 (1, 4, 15, 89, 176) [0.84192526 0.87788764 0.87939774 0.95181742 0... 0.883332
答案1
得分: 3
你可以使用numpy.fromstring
来转换你的字符串:
def str_to_mean(s):
return np.fromstring(s[1:-1], sep=' ').mean()
df['avg_score'] = df['cv_scores'].apply(str_to_mean)
或者使用str.extractall
,pandas.to_numeric
和groupby.mean
进行替代:
df['avg_score'] = (pd.to_numeric(df['cv_scores'].str.extractall(r'(\d+.?\d*)')[0],
errors='coerce')
.groupby(level=0).mean()
)
输出:
feature_idx cv_scores avg_score
0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0... 0.773186
1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0... 0.841507
2 (1, 4, 15) [0.82134069 0.84987581 0.86420576 0.93398567 0... 0.862182
3 (1, 4, 15, 176) [0.83244816 0.86689598 0.87095624 0.9445071 0... 0.874641
4 (1, 4, 15, 89, 176) [0.84192526 0.87788764 0.87939774 0.95181742 0... 0.883332
英文:
You can use numpy.fromstring
to convert your strings:
def str_to_mean(s):
return np.fromstring(s[1:-1], sep=' ').mean()
df['avg_score'] = df['cv_scores'].apply(str_to_mean)
Alternative with str.extractall
, pandas.to_numeric
, and groupby.mean
:
df['avg_score'] = (pd.to_numeric(df['cv_scores'].str.extractall(r'(\d+.?\d*)')[0],
errors='coerce')
.groupby(level=0).mean()
)
Output:
feature_idx cv_scores avg_score
0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0... 0.773186
1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0... 0.841507
2 (1, 4, 15) [0.82134069 0.84987581 0.86420576 0.93398567 0... 0.862182
3 (1, 4, 15, 176) [0.83244816 0.86689598 0.87095624 0.9445071 0... 0.874641
4 (1, 4, 15, 89, 176) [0.84192526 0.87788764 0.87939774 0.95181742 0... 0.883332
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论