pandas: 将字符串列转换为浮点数数组

huangapple go评论58阅读模式
英文:

pandas: convert string column to array of float

问题

我有以下的数据框:

df = pd.DataFrame(
    {'feature_idx': ['(4,)', '(4, 15)', '(1, 4, 15)', '(1, 4, 15, 176)', '(1, 4, 15, 89, 176)'],
     'cv_scores': ['[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]',
                   '[0.79227296 0.82675175 0.83723801 0.92502134 0.82625185]',
                   '[0.82134069 0.84987581 0.86420576 0.93398567 0.84150328]',
                   '[0.83244816 0.86689598 0.87095624 0.9445071  0.85839512]',
                   '[0.84192526 0.87788764 0.87939774 0.95181742 0.86563099]']}
)

df.head(2)
  feature_idx                                           cv_scores
0        (4,)  [0.71936929 0.75262699 0.77660679 0.85333625 0...
1     (4, 15)  [0.79227296 0.82675175 0.83723801 0.92502134 0...

cv_scores 包含了5折交叉验证的分数字符串,例如:

>>> sub_df.iloc[0]['cv_scores']
'[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]'

>>> type(sub_df.iloc[0]['cv_scores'])
str

我想要添加一个名为 avg_score 的列,用于计算每个特征的平均分数(cv_scores / 5)。

由于 cv_scores 是字符串,我需要一种将该列转换为浮点数数组的方法,以获取所需的列。

期望结果

>>> df_result
      feature_idx                                          cv_scores  avg_score
0           (4,)  [0.71936929 0.75262699 0.77660679 0.85333625 0...   0.773186
1        (4, 15)  [0.79227296 0.82675175 0.83723801 0.92502134 0...   0.841507
2     (1, 4, 15)  [0.82134069 0.84987581 0.86420576 0.93398567 0...   0.862182
3  (1, 4, 15, 176)  [0.83244816 0.86689598 0.87095624 0.9445071  0...   0.874641
4  (1, 4, 15, 89, 176)  [0.84192526 0.87788764 0.87939774 0.95181742 0...   0.883332
英文:

I have a the following df:

df = pd.DataFrame(
    {'feature_idx': ['(4,)','(4, 15)','(1, 4, 15)',
  '(1, 4, 15, 176)','(1, 4, 15, 89, 176)'],
 'cv_scores': ['[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]',
  '[0.79227296 0.82675175 0.83723801 0.92502134 0.82625185]',
  '[0.82134069 0.84987581 0.86420576 0.93398567 0.84150328]',
  '[0.83244816 0.86689598 0.87095624 0.9445071  0.85839512]',
  '[0.84192526 0.87788764 0.87939774 0.95181742 0.86563099]']}
)

df.head(2)
 	feature_idx 	      cv_scores
0 	      (4,) 	    [0.71936929 0.75262699 0.77660679 0.85333625 0...
1 	   (4, 15) 	    [0.79227296 0.82675175 0.83723801 0.92502134 0...

The column cv_scores contains string for the scores of 5-fold, so for example.

>>> sub_df.iloc[0]['cv_scores']
'[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]'

>>> type(sub_df.iloc[0]['cv_scores'])
 str

I would like to add a column avg_score for the average score of each feature (cv_scores / 5).

Since cv_scores is a string, I need a way to convert this column to array of float, to derive the intended column.

Expected results

>>> df_result
 	         feature_idx 	cv_scores 	                                       avg_score
0 	               (4,) 	[0.71936929 0.75262699 0.77660679 0.85333625 0... 	0.773186
1 	            (4, 15) 	[0.79227296 0.82675175 0.83723801 0.92502134 0... 	0.841507
2 	         (1, 4, 15)     [0.82134069 0.84987581 0.86420576 0.93398567 0... 	0.862182
3 	    (1, 4, 15, 176) 	[0.83244816 0.86689598 0.87095624 0.9445071 0... 	0.874641
4 	(1, 4, 15, 89, 176)     [0.84192526 0.87788764 0.87939774 0.95181742 0... 	0.883332

答案1

得分: 3

你可以使用numpy.fromstring来转换你的字符串:

def str_to_mean(s):
    return np.fromstring(s[1:-1], sep=' ').mean()

df['avg_score'] = df['cv_scores'].apply(str_to_mean)

或者使用str.extractallpandas.to_numericgroupby.mean进行替代:

df['avg_score'] = (pd.to_numeric(df['cv_scores'].str.extractall(r'(\d+.?\d*)')[0],
                                 errors='coerce')
                     .groupby(level=0).mean()
                  )

输出:

           feature_idx                                          cv_scores  avg_score
0                 (4,)  [0.71936929 0.75262699 0.77660679 0.85333625 0...   0.773186
1              (4, 15)  [0.79227296 0.82675175 0.83723801 0.92502134 0...   0.841507
2           (1, 4, 15)  [0.82134069 0.84987581 0.86420576 0.93398567 0...   0.862182
3      (1, 4, 15, 176)  [0.83244816 0.86689598 0.87095624 0.9445071  0...   0.874641
4  (1, 4, 15, 89, 176)  [0.84192526 0.87788764 0.87939774 0.95181742 0...   0.883332
英文:

You can use numpy.fromstring to convert your strings:

def str_to_mean(s):
    return np.fromstring(s[1:-1], sep=' ').mean()

df['avg_score'] = df['cv_scores'].apply(str_to_mean)

Alternative with str.extractall, pandas.to_numeric, and groupby.mean:

df['avg_score'] = (pd.to_numeric(df['cv_scores'].str.extractall(r'(\d+.?\d*)')[0],
                                 errors='coerce')
                     .groupby(level=0).mean()
                  )

Output:

           feature_idx                                          cv_scores  avg_score
0                 (4,)  [0.71936929 0.75262699 0.77660679 0.85333625 0...   0.773186
1              (4, 15)  [0.79227296 0.82675175 0.83723801 0.92502134 0...   0.841507
2           (1, 4, 15)  [0.82134069 0.84987581 0.86420576 0.93398567 0...   0.862182
3      (1, 4, 15, 176)  [0.83244816 0.86689598 0.87095624 0.9445071  0...   0.874641
4  (1, 4, 15, 89, 176)  [0.84192526 0.87788764 0.87939774 0.95181742 0...   0.883332

huangapple
  • 本文由 发表于 2023年6月19日 21:56:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76507337.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定