2023年6月19日 21:56:33go评论58阅读模式

英文:

pandas: convert string column to array of float

问题

我有以下的数据框：

df = pd.DataFrame(
    {'feature_idx': ['(4,)', '(4, 15)', '(1, 4, 15)', '(1, 4, 15, 176)', '(1, 4, 15, 89, 176)'],
     'cv_scores': ['[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]',
                   '[0.79227296 0.82675175 0.83723801 0.92502134 0.82625185]',
                   '[0.82134069 0.84987581 0.86420576 0.93398567 0.84150328]',
                   '[0.83244816 0.86689598 0.87095624 0.9445071  0.85839512]',
                   '[0.84192526 0.87788764 0.87939774 0.95181742 0.86563099]']}
)

df.head(2)
  feature_idx                                           cv_scores
0        (4,)  [0.71936929 0.75262699 0.77660679 0.85333625 0...
1     (4, 15)  [0.79227296 0.82675175 0.83723801 0.92502134 0...

列 cv_scores 包含了5折交叉验证的分数字符串，例如：

>>> sub_df.iloc[0]['cv_scores']
'[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]'

>>> type(sub_df.iloc[0]['cv_scores'])
str

我想要添加一个名为 avg_score 的列，用于计算每个特征的平均分数（cv_scores / 5）。

由于 cv_scores 是字符串，我需要一种将该列转换为浮点数数组的方法，以获取所需的列。

期望结果

>>> df_result
      feature_idx                                          cv_scores  avg_score
0           (4,)  [0.71936929 0.75262699 0.77660679 0.85333625 0...   0.773186
1        (4, 15)  [0.79227296 0.82675175 0.83723801 0.92502134 0...   0.841507
2     (1, 4, 15)  [0.82134069 0.84987581 0.86420576 0.93398567 0...   0.862182
3  (1, 4, 15, 176)  [0.83244816 0.86689598 0.87095624 0.9445071  0...   0.874641
4  (1, 4, 15, 89, 176)  [0.84192526 0.87788764 0.87939774 0.95181742 0...   0.883332

英文:

I have a the following df:

df = pd.DataFrame(
    {&#39;feature_idx&#39;: [&#39;(4,)&#39;,&#39;(4, 15)&#39;,&#39;(1, 4, 15)&#39;,
  &#39;(1, 4, 15, 176)&#39;,&#39;(1, 4, 15, 89, 176)&#39;],
 &#39;cv_scores&#39;: [&#39;[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]&#39;,
  &#39;[0.79227296 0.82675175 0.83723801 0.92502134 0.82625185]&#39;,
  &#39;[0.82134069 0.84987581 0.86420576 0.93398567 0.84150328]&#39;,
  &#39;[0.83244816 0.86689598 0.87095624 0.9445071  0.85839512]&#39;,
  &#39;[0.84192526 0.87788764 0.87939774 0.95181742 0.86563099]&#39;]}
)

df.head(2)
 	feature_idx 	      cv_scores
0 	      (4,) 	    [0.71936929 0.75262699 0.77660679 0.85333625 0...
1 	   (4, 15) 	    [0.79227296 0.82675175 0.83723801 0.92502134 0...

The column cv_scores contains string for the scores of 5-fold, so for example.

&gt;&gt;&gt; sub_df.iloc[0][&#39;cv_scores&#39;]
&#39;[0.71936929 0.75262699 0.77660679 0.85333625 0.76398875]&#39;

&gt;&gt;&gt; type(sub_df.iloc[0][&#39;cv_scores&#39;])
 str

I would like to add a column avg_score for the average score of each feature (cv_scores / 5).

Since cv_scores is a string, I need a way to convert this column to array of float, to derive the intended column.

Expected results

&gt;&gt;&gt; df_result
 	         feature_idx 	cv_scores 	                                       avg_score
0 	               (4,) 	[0.71936929 0.75262699 0.77660679 0.85333625 0... 	0.773186
1 	            (4, 15) 	[0.79227296 0.82675175 0.83723801 0.92502134 0... 	0.841507
2 	         (1, 4, 15)     [0.82134069 0.84987581 0.86420576 0.93398567 0... 	0.862182
3 	    (1, 4, 15, 176) 	[0.83244816 0.86689598 0.87095624 0.9445071 0... 	0.874641
4 	(1, 4, 15, 89, 176)     [0.84192526 0.87788764 0.87939774 0.95181742 0... 	0.883332

答案1

得分: 3

你可以使用numpy.fromstring来转换你的字符串：

def str_to_mean(s):
    return np.fromstring(s[1:-1], sep=' ').mean()

df['avg_score'] = df['cv_scores'].apply(str_to_mean)

或者使用str.extractall，pandas.to_numeric和groupby.mean进行替代：

df['avg_score'] = (pd.to_numeric(df['cv_scores'].str.extractall(r'(\d+.?\d*)')[0],
                                 errors='coerce')
                     .groupby(level=0).mean()
                  )

输出：

           feature_idx                                          cv_scores  avg_score
0                 (4,)  [0.71936929 0.75262699 0.77660679 0.85333625 0...   0.773186
1              (4, 15)  [0.79227296 0.82675175 0.83723801 0.92502134 0...   0.841507
2           (1, 4, 15)  [0.82134069 0.84987581 0.86420576 0.93398567 0...   0.862182
3      (1, 4, 15, 176)  [0.83244816 0.86689598 0.87095624 0.9445071  0...   0.874641
4  (1, 4, 15, 89, 176)  [0.84192526 0.87788764 0.87939774 0.95181742 0...   0.883332

英文:

You can use numpy.fromstring to convert your strings:

def str_to_mean(s):
    return np.fromstring(s[1:-1], sep=&#39; &#39;).mean()

df[&#39;avg_score&#39;] = df[&#39;cv_scores&#39;].apply(str_to_mean)

Alternative with str.extractall, pandas.to_numeric, and groupby.mean:

df[&#39;avg_score&#39;] = (pd.to_numeric(df[&#39;cv_scores&#39;].str.extractall(r&#39;(\d+.?\d*)&#39;)[0],
                                 errors=&#39;coerce&#39;)
                     .groupby(level=0).mean()
                  )

Output:

           feature_idx                                          cv_scores  avg_score
0                 (4,)  [0.71936929 0.75262699 0.77660679 0.85333625 0...   0.773186
1              (4, 15)  [0.79227296 0.82675175 0.83723801 0.92502134 0...   0.841507
2           (1, 4, 15)  [0.82134069 0.84987581 0.86420576 0.93398567 0...   0.862182
3      (1, 4, 15, 176)  [0.83244816 0.86689598 0.87095624 0.9445071  0...   0.874641
4  (1, 4, 15, 89, 176)  [0.84192526 0.87788764 0.87939774 0.95181742 0...   0.883332

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pandas: 将字符串列转换为浮点数数组

问题

答案1

如何使用Python访问受密码保护的Pastebin？

No module named 'pydantic_core._pydantic_core' in AWS Lambda though library is installed for fast api based code

在函数内部更新一个全局的Python变量。

Openpyxl：格式化整行或整列直到指定单元格 – 是否可行？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论