英文:
Remove non numeric rows from dataframe
问题
我有一个关于患者和他们基因表达的数据框,它的格式如下:
患者ID | 基因1 | 基因2 | ... | 基因10000
p1 0.142 0.233 ... bla
p2 0.243 0.243 ... -0.364
...
p4000 1.423 bla ... -1.222
如您所见,该数据框包含噪音,其中包含非浮点值的单元格。
我想要删除具有任何列包含非数字值的每一行。
我已经成功使用 apply
和 pd.to_numeric
来做到这一点,像这样:
cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df = df.dropna()
问题是它运行起来非常慢,我需要更好和更高效的方法来实现这一点。
英文:
I have a dataframe of patients and their gene expressions. I has this format:
Patient_ID | gene1 | gene2 | ... | gene10000
p1 0.142 0.233 ... bla
p2 0.243 0.243 ... -0.364
...
p4000 1.423 bla ... -1.222
As you see, that dataframe contains noise, with cells that are values other then a float value.
I want to remove every row that has a any column with non numeric values.
I've managed to do this using apply
and pd.to_numeric
like this:
cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df = df.dropna()
The problem is that it's taking for ever to run, and I need a better and more efficient way of achieving this
EDIT: To reproduce something like my data:
arr = np.random.random_sample((3000,10000))
df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(10000)])
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
df['gene0'][2] = 'bla'
df['gene9998'][4] = 'bla'
答案1
得分: 2
-
尝试了使用numpy是值得的
-
我得到了一个比之前快 30-60 倍的版本(数组更大,改进更大)
-
转换为numpy数组(
.values
) -
遍历所有行
-
尝试将每一行转换为浮点数行
-
如果转换失败(存在一些 NaN),在布尔数组中进行标记
-
基于结果创建数组
import pandas as pd
import numpy as np
from line_profiler_pycharm import profile
# 在函数op_version中的代码已经被更改为适应题目中的情况,所以无需重复翻译。
# 在np_version函数中的代码未被更改,故保留原文。
英文:
Was right it is worth trying numpy
I got 30-60x times faster version (bigger array, larger improvement)
- Convert to numpy array (
.values
) - Iterate through all rows
- Try to convert each row to row of floats
- If it fails (some NaN present), note this in boolean array
- Create array based on the results
Code:
import pandas as pd
import numpy as np
from line_profiler_pycharm import profile
def op_version(df):
cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
return df.dropna()
def np_version(df):
keep = np.full(len(df), True)
for idx, row in enumerate(df.values[:, 1:]):
try:
row.astype(np.float)
except:
keep[idx] = False
pass # maybe its better to store to_remove list, depends on data
return df[keep]
@profile
def main():
arr = np.random.random_sample((3000, 5000))
df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(5000)])
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)],
columns=['Patient_ID']), df], axis=1)
df['gene0'][2] = 'bla'
df['gene998'][4] = 'bla'
df2 = df.copy()
df = op_version(df)
df2 = np_version(df2)
Note I decreased number of columns so it is more feasible for tests.
Also, fixed small bug in your example, instead of:
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
I think should be
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)], columns=['Patient_ID']),df],axis = 1)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论