pandas: 将字符串列转换为浮点数数组

huangapple go评论114阅读模式
英文:

pandas: convert string column to array of float

问题

我有以下的数据框:

  1. df = pd.DataFrame(
  2. {'feature_idx': ['(4,)', '(4, 15)', '(1, 4, 15)', '(1, 4, 15, 176)', '(1, 4, 15, 89, 176)'],
  3. 'cv_scores': ['[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]',
  4. '[0.79227296 0.82675175 0.83723801 0.92502134 0.82625185]',
  5. '[0.82134069 0.84987581 0.86420576 0.93398567 0.84150328]',
  6. '[0.83244816 0.86689598 0.87095624 0.9445071 0.85839512]',
  7. '[0.84192526 0.87788764 0.87939774 0.95181742 0.86563099]']}
  8. )
  9. df.head(2)
  10. feature_idx cv_scores
  11. 0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0...
  12. 1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0...

cv_scores 包含了5折交叉验证的分数字符串,例如:

  1. >>> sub_df.iloc[0]['cv_scores']
  2. '[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]'
  3. >>> type(sub_df.iloc[0]['cv_scores'])
  4. str

我想要添加一个名为 avg_score 的列,用于计算每个特征的平均分数(cv_scores / 5)。

由于 cv_scores 是字符串,我需要一种将该列转换为浮点数数组的方法,以获取所需的列。

期望结果

  1. >>> df_result
  2. feature_idx cv_scores avg_score
  3. 0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0... 0.773186
  4. 1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0... 0.841507
  5. 2 (1, 4, 15) [0.82134069 0.84987581 0.86420576 0.93398567 0... 0.862182
  6. 3 (1, 4, 15, 176) [0.83244816 0.86689598 0.87095624 0.9445071 0... 0.874641
  7. 4 (1, 4, 15, 89, 176) [0.84192526 0.87788764 0.87939774 0.95181742 0... 0.883332
英文:

I have a the following df:

  1. df = pd.DataFrame(
  2. {'feature_idx': ['(4,)','(4, 15)','(1, 4, 15)',
  3. '(1, 4, 15, 176)','(1, 4, 15, 89, 176)'],
  4. 'cv_scores': ['[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]',
  5. '[0.79227296 0.82675175 0.83723801 0.92502134 0.82625185]',
  6. '[0.82134069 0.84987581 0.86420576 0.93398567 0.84150328]',
  7. '[0.83244816 0.86689598 0.87095624 0.9445071 0.85839512]',
  8. '[0.84192526 0.87788764 0.87939774 0.95181742 0.86563099]']}
  9. )
  10. df.head(2)
  11. feature_idx cv_scores
  12. 0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0...
  13. 1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0...

The column cv_scores contains string for the scores of 5-fold, so for example.

  1. >>> sub_df.iloc[0]['cv_scores']
  2. '[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]'
  3. >>> type(sub_df.iloc[0]['cv_scores'])
  4. str

I would like to add a column avg_score for the average score of each feature (cv_scores / 5).

Since cv_scores is a string, I need a way to convert this column to array of float, to derive the intended column.

Expected results

  1. >>> df_result
  2. feature_idx cv_scores avg_score
  3. 0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0... 0.773186
  4. 1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0... 0.841507
  5. 2 (1, 4, 15) [0.82134069 0.84987581 0.86420576 0.93398567 0... 0.862182
  6. 3 (1, 4, 15, 176) [0.83244816 0.86689598 0.87095624 0.9445071 0... 0.874641
  7. 4 (1, 4, 15, 89, 176) [0.84192526 0.87788764 0.87939774 0.95181742 0... 0.883332

答案1

得分: 3

你可以使用numpy.fromstring来转换你的字符串:

  1. def str_to_mean(s):
  2. return np.fromstring(s[1:-1], sep=' ').mean()
  3. df['avg_score'] = df['cv_scores'].apply(str_to_mean)

或者使用str.extractallpandas.to_numericgroupby.mean进行替代:

  1. df['avg_score'] = (pd.to_numeric(df['cv_scores'].str.extractall(r'(\d+.?\d*)')[0],
  2. errors='coerce')
  3. .groupby(level=0).mean()
  4. )

输出:

  1. feature_idx cv_scores avg_score
  2. 0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0... 0.773186
  3. 1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0... 0.841507
  4. 2 (1, 4, 15) [0.82134069 0.84987581 0.86420576 0.93398567 0... 0.862182
  5. 3 (1, 4, 15, 176) [0.83244816 0.86689598 0.87095624 0.9445071 0... 0.874641
  6. 4 (1, 4, 15, 89, 176) [0.84192526 0.87788764 0.87939774 0.95181742 0... 0.883332
英文:

You can use numpy.fromstring to convert your strings:

  1. def str_to_mean(s):
  2. return np.fromstring(s[1:-1], sep=' ').mean()
  3. df['avg_score'] = df['cv_scores'].apply(str_to_mean)

Alternative with str.extractall, pandas.to_numeric, and groupby.mean:

  1. df['avg_score'] = (pd.to_numeric(df['cv_scores'].str.extractall(r'(\d+.?\d*)')[0],
  2. errors='coerce')
  3. .groupby(level=0).mean()
  4. )

Output:

  1. feature_idx cv_scores avg_score
  2. 0 (4,) [0.71936929 0.75262699 0.77660679 0.85333625 0... 0.773186
  3. 1 (4, 15) [0.79227296 0.82675175 0.83723801 0.92502134 0... 0.841507
  4. 2 (1, 4, 15) [0.82134069 0.84987581 0.86420576 0.93398567 0... 0.862182
  5. 3 (1, 4, 15, 176) [0.83244816 0.86689598 0.87095624 0.9445071 0... 0.874641
  6. 4 (1, 4, 15, 89, 176) [0.84192526 0.87788764 0.87939774 0.95181742 0... 0.883332

huangapple
  • 本文由 发表于 2023年6月19日 21:56:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76507337.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定