2023年2月8日 23:20:22go评论82阅读模式

英文:

Remove non numeric rows from dataframe

问题

我有一个关于患者和他们基因表达的数据框，它的格式如下：

患者ID | 基因1 | 基因2 | ... | 基因10000
  p1   0.142   0.233   ...      bla
  p2   0.243   0.243   ...    -0.364
  ...
  p4000  1.423    bla    ...    -1.222

如您所见，该数据框包含噪音，其中包含非浮点值的单元格。

我想要删除具有任何列包含非数字值的每一行。

我已经成功使用 apply 和 pd.to_numeric 来做到这一点，像这样：

cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df = df.dropna()

问题是它运行起来非常慢，我需要更好和更高效的方法来实现这一点。

英文:

I have a dataframe of patients and their gene expressions. I has this format:

Patient_ID | gene1 | gene2 | ... | gene10000
    p1       0.142   0.233   ...      bla
    p2       0.243   0.243   ...    -0.364
    ...
    p4000    1.423    bla    ...    -1.222

As you see, that dataframe contains noise, with cells that are values other then a float value.

I want to remove every row that has a any column with non numeric values.

I've managed to do this using apply and pd.to_numeric like this:

cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors=&#39;coerce&#39;)
df = df.dropna()

The problem is that it's taking for ever to run, and I need a better and more efficient way of achieving this

EDIT: To reproduce something like my data:

arr = np.random.random_sample((3000,10000))
df = pd.DataFrame(arr, columns=[&#39;gene&#39; + str(i) for i in range(10000)])
df = pd.concat([pd.DataFrame([&#39;p&#39; + str(i) for i in range(10000)], columns=[&#39;Patient_ID&#39;]),df],axis = 1)
df[&#39;gene0&#39;][2] = &#39;bla&#39;
df[&#39;gene9998&#39;][4] = &#39;bla&#39;

答案1

得分: 2

尝试了使用numpy是值得的
我得到了一个比之前快 30-60 倍的版本（数组更大，改进更大）
转换为numpy数组（.values）
遍历所有行
尝试将每一行转换为浮点数行
如果转换失败（存在一些 NaN），在布尔数组中进行标记
基于结果创建数组

import pandas as pd
import numpy as np
from line_profiler_pycharm import profile
# 在函数op_version中的代码已经被更改为适应题目中的情况，所以无需重复翻译。
# 在np_version函数中的代码未被更改，故保留原文。

英文:

Was right it is worth trying numpy

I got 30-60x times faster version (bigger array, larger improvement)

Convert to numpy array (.values)
Iterate through all rows
Try to convert each row to row of floats
If it fails (some NaN present), note this in boolean array
Create array based on the results

Code:

import pandas as pd
import numpy as np
from line_profiler_pycharm import profile
def op_version(df):
    cols = df.columns[1:]
    df[cols] = df[cols].apply(pd.to_numeric, errors=&#39;coerce&#39;)
    return df.dropna()
def np_version(df):
    keep = np.full(len(df), True)
    for idx, row in enumerate(df.values[:, 1:]):
        try:
            row.astype(np.float)
        except:
            keep[idx] = False
            pass    # maybe its better to store to_remove list, depends on data
    return df[keep]
@profile
def main():
    arr = np.random.random_sample((3000, 5000))
    df = pd.DataFrame(arr, columns=[&#39;gene&#39; + str(i) for i in range(5000)])
    df = pd.concat([pd.DataFrame([&#39;p&#39; + str(i) for i in range(3000)],
                                 columns=[&#39;Patient_ID&#39;]), df], axis=1)
    df[&#39;gene0&#39;][2] = &#39;bla&#39;
    df[&#39;gene998&#39;][4] = &#39;bla&#39;
    df2 = df.copy()
    df = op_version(df)
    df2 = np_version(df2)

Note I decreased number of columns so it is more feasible for tests.

Also, fixed small bug in your example, instead of:
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
I think should be
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)], columns=['Patient_ID']),df],axis = 1)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

删除数据框中的非数值行。

问题

答案1

使用多进程时，如果函数返回值的类型错误，会出现问题。

在Markdown中的`
`标签内的Python代码。

Error with exporting a myokit model to python 导出myokit模型到Python时出现错误

如何在Python中将Excel数据转换为JSON？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。