英文:
Reshaping and Cleaning a Tab-Delimited Data File with Pandas in Python
问题
我试过了但没有得到预期的结果,我无法找出我的错误,有人能帮忙解决这个问题吗?
英文:
I have a txt file which has this format-
0.0 0 5 6.31000
5.29559 2.38176 0.51521 0.04454 0.00000
0 0 0 0 2
0.0 0 4 6.31000
4.32454 1.77600 0.04454 0.00000
0 0 0 2
0.0 0 2 6.31000
1.55590 0.00000
0 0
0.0 0 6 6.31000
5.37285 4.39339 3.56905 0.83230 0.04454 0.00000
0 0 0 0 0 2
0.0 0 3 6.31000
4.22062 1.60321 0.00000
0 0 0
I am trying to create a data frame by removing lines 1,4,7,10 so on..., and only taking values from the following line which has the corresponding value 0 in the following line.
For example, I will want to store 5.29559 because in the following line, the value corresponding is 0 but I do not want 0.00000 because the corresponding value to it is 2 and no 0. In short, I want to create a data frame which will look like this-
5.29559 2.38176 0.51521 0.04454
4.32454 1.77600 0.04454
1.55590 0.00000
5.37285 4.39339 3.56905 0.83230 0.04454
4.22062 1.60321 0.00000
So far I have tried-
import pandas as pd
filename = r"data.txt"
# This returns a dataframe with a single column
df = pd.read_table(filename, header=None)
# remove first row
df = df.drop(index=0) df = df.drop(df.index[df.index % 3 == 0])
# remove spaces from beginning of all rows
df = df.applymap(lambda x: x.lstrip())
df.to_csv('file_with_quotes.txt', sep=' ', index=False)
with open('file_with_quotes.txt', 'r') as f1, open('file_without_quotes.txt', 'w') as f2:
line_num = 0
for line in f1:
if line_num == 0:
line_num += 1
continue
line = line.strip().replace('"', '')
f2.write(line + '\n')
import os
#delete the file using os.remove()
os.remove('file_with_quotes.txt')
df_new = pd.read_table('file_without_quotes.txt', header=None)
df_new = df[0].str.split(" ", expand=True)
df_new = df_new.replace('', pd.np.nan)
# replace NaN values with empty string
df_new = df_new.fillna("")
# remove empty strings using applymap()
df_new = df_new.applymap(lambda x: x if x != "" else None)
(df_new.stack()
.groupby(level=0)
.apply(lambda df: df.reset_index(drop=True))
.unstack())
But I still do not see the expected result and am not able to figure out my mistake, can someone please help solve the problem?
答案1
得分: 1
Assuming a right row doesn't start by 0 and the following line acts as a mask:
df = pd.read_table('data.txt', header=None)
m1 = df[0] > 0
m2 = m1.shift(fill_value=False)
out = df[m1].where(df[m2].eq(0).set_index(df[m1].index))
Output:
>>> out
0 1 2 3 4 5
1 5.29559 2.38176 0.51521 0.04454 NaN NaN
4 4.32454 1.77600 0.04454 NaN NaN NaN
7 1.55590 0.00000 NaN NaN NaN NaN
10 5.37285 4.39339 3.56905 0.83230 0.04454 NaN
13 4.22062 1.60321 0.00000 NaN NaN NaN
>>> out.fillna('')
0 1 2 3 4 5
1 5.29559 2.38176 0.51521 0.04454
4 4.32454 1.77600 0.04454
7 1.55590 0.00000
10 5.37285 4.39339 3.56905 0.8323 0.04454
13 4.22062 1.60321 0.0
Update
Since you skip the first row, you can also use indexing by position:
out = df.iloc[1::3].where(df.iloc[2::3].eq(0).set_index(df.iloc[1::3].index))
To be reproducible:
s = '0.0\t0\t5\t6.31000\t\t\n5.29559\t2.38176\t0.51521\t0.04454\t0.00000\t\n0\t0\t0\t0\t2\t\n0.0\t0\t4\t6.31000\t\t\n4.32454\t1.77600\t0.04454\t0.00000\t\t\n0\t0\t0\t2\t\t\n0.0\t0\t2\t6.31000\t\t\n1.55590\t0.00000\t\t\t\t\n0\t0\t\t\t\t\n0.0\t0\t6\t6.31000\t\t\n5.37285\t4.39339\t3.56905\t0.83230\t0.04454\t0.00000\n0\t0\t0\t0\t0\t2\n0.0\t0\t3\t6.31000\t\t\n4.22062\t1.60321\t0.00000\t\t\t\n0\t0\t0\t\t\t'
with open('data.txt', 'w') as fp:
print(s, file=fp)
英文:
Assuming a right row doesn't start by 0 and the following line act as a mask:
df = pd.read_table('data.txt', header=None)
m1 = df[0] > 0
m2 = m1.shift(fill_value=False)
out = df[m1].where(df[m2].eq(0).set_index(df[m1].index))
Output:
>>> out
0 1 2 3 4 5
1 5.29559 2.38176 0.51521 0.04454 NaN NaN
4 4.32454 1.77600 0.04454 NaN NaN NaN
7 1.55590 0.00000 NaN NaN NaN NaN
10 5.37285 4.39339 3.56905 0.83230 0.04454 NaN
13 4.22062 1.60321 0.00000 NaN NaN NaN
>>> out.fillna('')
0 1 2 3 4 5
1 5.29559 2.38176 0.51521 0.04454
4 4.32454 1.77600 0.04454
7 1.55590 0.00000
10 5.37285 4.39339 3.56905 0.8323 0.04454
13 4.22062 1.60321 0.0
Update
Since you skip the first row, you can also use indexing by position:
out = df.iloc[1::3].where(df.iloc[2::3].eq(0).set_index(df.iloc[1::3].index))
To be reproducible:
s = '0.0\t0\t5\t6.31000\t\t\n5.29559\t2.38176\t0.51521\t0.04454\t0.00000\t\n0\t0\t0\t0\t2\t\n0.0\t0\t4\t6.31000\t\t\n4.32454\t1.77600\t0.04454\t0.00000\t\t\n0\t0\t0\t2\t\t\n0.0\t0\t2\t6.31000\t\t\n1.55590\t0.00000\t\t\t\t\n0\t0\t\t\t\t\n0.0\t0\t6\t6.31000\t\t\n5.37285\t4.39339\t3.56905\t0.83230\t0.04454\t0.00000\n0\t0\t0\t0\t0\t2\n0.0\t0\t3\t6.31000\t\t\n4.22062\t1.60321\t0.00000\t\t\t\n0\t0\t0\t\t\t'
with open('data.txt', 'w') as fp:
print(s, file=fp)
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论