删除数据框中的非数值行。

huangapple go评论55阅读模式
英文:

Remove non numeric rows from dataframe

问题

我有一个关于患者和他们基因表达的数据框,它的格式如下:

患者ID | 基因1 | 基因2 | ... | 基因10000
  p1   0.142   0.233   ...      bla
  p2   0.243   0.243   ...    -0.364
  ...
  p4000  1.423    bla    ...    -1.222

如您所见,该数据框包含噪音,其中包含非浮点值的单元格。

我想要删除具有任何列包含非数字值的每一行。

我已经成功使用 applypd.to_numeric 来做到这一点,像这样:

cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df = df.dropna()

问题是它运行起来非常慢,我需要更好和更高效的方法来实现这一点。

英文:

I have a dataframe of patients and their gene expressions. I has this format:

Patient_ID | gene1 | gene2 | ... | gene10000
    p1       0.142   0.233   ...      bla
    p2       0.243   0.243   ...    -0.364
    ...
    p4000    1.423    bla    ...    -1.222

As you see, that dataframe contains noise, with cells that are values other then a float value.

I want to remove every row that has a any column with non numeric values.

I've managed to do this using apply and pd.to_numeric like this:

cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df = df.dropna()

The problem is that it's taking for ever to run, and I need a better and more efficient way of achieving this

EDIT: To reproduce something like my data:

arr = np.random.random_sample((3000,10000))
df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(10000)])
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
df['gene0'][2] = 'bla'
df['gene9998'][4] = 'bla'

答案1

得分: 2

  1. 尝试了使用numpy是值得的 删除数据框中的非数值行。

  2. 我得到了一个比之前快 30-60 倍的版本(数组更大,改进更大)

  3. 转换为numpy数组(.values

  4. 遍历所有行

  5. 尝试将每一行转换为浮点数行

  6. 如果转换失败(存在一些 NaN),在布尔数组中进行标记

  7. 基于结果创建数组

import pandas as pd
import numpy as np
from line_profiler_pycharm import profile

# 在函数op_version中的代码已经被更改为适应题目中的情况,所以无需重复翻译。
# 在np_version函数中的代码未被更改,故保留原文。
英文:

Was right it is worth trying numpy 删除数据框中的非数值行。

I got 30-60x times faster version (bigger array, larger improvement)

  1. Convert to numpy array (.values)
  2. Iterate through all rows
  3. Try to convert each row to row of floats
  4. If it fails (some NaN present), note this in boolean array
  5. Create array based on the results

Code:

import pandas as pd
import numpy as np
from line_profiler_pycharm import profile


def op_version(df):
    cols = df.columns[1:]
    df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
    return df.dropna()


def np_version(df):
    keep = np.full(len(df), True)
    for idx, row in enumerate(df.values[:, 1:]):
        try:
            row.astype(np.float)
        except:
            keep[idx] = False
            pass    # maybe its better to store to_remove list, depends on data
    return df[keep]


@profile
def main():
    arr = np.random.random_sample((3000, 5000))
    df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(5000)])
    df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)],
                                 columns=['Patient_ID']), df], axis=1)
    df['gene0'][2] = 'bla'
    df['gene998'][4] = 'bla'
    df2 = df.copy()

    df = op_version(df)
    df2 = np_version(df2)

Note I decreased number of columns so it is more feasible for tests.

Also, fixed small bug in your example, instead of:
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
I think should be
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)], columns=['Patient_ID']),df],axis = 1)

删除数据框中的非数值行。

huangapple
  • 本文由 发表于 2023年2月8日 23:20:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75387921.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定