用Pandas在Python中重塑和清理制表符分隔的数据文件

huangapple go评论105阅读模式
英文:

Reshaping and Cleaning a Tab-Delimited Data File with Pandas in Python

问题

我试过了但没有得到预期的结果,我无法找出我的错误,有人能帮忙解决这个问题吗?

英文:

I have a txt file which has this format-

  1. 0.0 0 5 6.31000
  2. 5.29559 2.38176 0.51521 0.04454 0.00000
  3. 0 0 0 0 2
  4. 0.0 0 4 6.31000
  5. 4.32454 1.77600 0.04454 0.00000
  6. 0 0 0 2
  7. 0.0 0 2 6.31000
  8. 1.55590 0.00000
  9. 0 0
  10. 0.0 0 6 6.31000
  11. 5.37285 4.39339 3.56905 0.83230 0.04454 0.00000
  12. 0 0 0 0 0 2
  13. 0.0 0 3 6.31000
  14. 4.22062 1.60321 0.00000
  15. 0 0 0

I am trying to create a data frame by removing lines 1,4,7,10 so on..., and only taking values from the following line which has the corresponding value 0 in the following line.

For example, I will want to store 5.29559 because in the following line, the value corresponding is 0 but I do not want 0.00000 because the corresponding value to it is 2 and no 0. In short, I want to create a data frame which will look like this-

  1. 5.29559 2.38176 0.51521 0.04454
  2. 4.32454 1.77600 0.04454
  3. 1.55590 0.00000
  4. 5.37285 4.39339 3.56905 0.83230 0.04454
  5. 4.22062 1.60321 0.00000

So far I have tried-

  1. import pandas as pd
  2. filename = r"data.txt"
  3. # This returns a dataframe with a single column
  4. df = pd.read_table(filename, header=None)
  5. # remove first row
  6. df = df.drop(index=0) df = df.drop(df.index[df.index % 3 == 0])
  7. # remove spaces from beginning of all rows
  8. df = df.applymap(lambda x: x.lstrip())
  9. df.to_csv('file_with_quotes.txt', sep=' ', index=False)
  10. with open('file_with_quotes.txt', 'r') as f1, open('file_without_quotes.txt', 'w') as f2:
  11. line_num = 0
  12. for line in f1:
  13. if line_num == 0:
  14. line_num += 1
  15. continue
  16. line = line.strip().replace('"', '')
  17. f2.write(line + '\n')
  18. import os
  19. #delete the file using os.remove()
  20. os.remove('file_with_quotes.txt')
  21. df_new = pd.read_table('file_without_quotes.txt', header=None)
  22. df_new = df[0].str.split(" ", expand=True)
  23. df_new = df_new.replace('', pd.np.nan)
  24. # replace NaN values with empty string
  25. df_new = df_new.fillna("")
  26. # remove empty strings using applymap()
  27. df_new = df_new.applymap(lambda x: x if x != "" else None)
  28. (df_new.stack()
  29. .groupby(level=0)
  30. .apply(lambda df: df.reset_index(drop=True))
  31. .unstack())

But I still do not see the expected result and am not able to figure out my mistake, can someone please help solve the problem?

答案1

得分: 1

Assuming a right row doesn't start by 0 and the following line acts as a mask:

  1. df = pd.read_table('data.txt', header=None)
  2. m1 = df[0] > 0
  3. m2 = m1.shift(fill_value=False)
  4. out = df[m1].where(df[m2].eq(0).set_index(df[m1].index))

Output:

  1. >>> out
  2. 0 1 2 3 4 5
  3. 1 5.29559 2.38176 0.51521 0.04454 NaN NaN
  4. 4 4.32454 1.77600 0.04454 NaN NaN NaN
  5. 7 1.55590 0.00000 NaN NaN NaN NaN
  6. 10 5.37285 4.39339 3.56905 0.83230 0.04454 NaN
  7. 13 4.22062 1.60321 0.00000 NaN NaN NaN
  8. >>> out.fillna('')
  9. 0 1 2 3 4 5
  10. 1 5.29559 2.38176 0.51521 0.04454
  11. 4 4.32454 1.77600 0.04454
  12. 7 1.55590 0.00000
  13. 10 5.37285 4.39339 3.56905 0.8323 0.04454
  14. 13 4.22062 1.60321 0.0

Update

Since you skip the first row, you can also use indexing by position:

  1. out = df.iloc[1::3].where(df.iloc[2::3].eq(0).set_index(df.iloc[1::3].index))

To be reproducible:

  1. s = '0.0\t0\t5\t6.31000\t\t\n5.29559\t2.38176\t0.51521\t0.04454\t0.00000\t\n0\t0\t0\t0\t2\t\n0.0\t0\t4\t6.31000\t\t\n4.32454\t1.77600\t0.04454\t0.00000\t\t\n0\t0\t0\t2\t\t\n0.0\t0\t2\t6.31000\t\t\n1.55590\t0.00000\t\t\t\t\n0\t0\t\t\t\t\n0.0\t0\t6\t6.31000\t\t\n5.37285\t4.39339\t3.56905\t0.83230\t0.04454\t0.00000\n0\t0\t0\t0\t0\t2\n0.0\t0\t3\t6.31000\t\t\n4.22062\t1.60321\t0.00000\t\t\t\n0\t0\t0\t\t\t'
  2. with open('data.txt', 'w') as fp:
  3. print(s, file=fp)
英文:

Assuming a right row doesn't start by 0 and the following line act as a mask:

  1. df = pd.read_table('data.txt', header=None)
  2. m1 = df[0] > 0
  3. m2 = m1.shift(fill_value=False)
  4. out = df[m1].where(df[m2].eq(0).set_index(df[m1].index))

Output:

  1. >>> out
  2. 0 1 2 3 4 5
  3. 1 5.29559 2.38176 0.51521 0.04454 NaN NaN
  4. 4 4.32454 1.77600 0.04454 NaN NaN NaN
  5. 7 1.55590 0.00000 NaN NaN NaN NaN
  6. 10 5.37285 4.39339 3.56905 0.83230 0.04454 NaN
  7. 13 4.22062 1.60321 0.00000 NaN NaN NaN
  8. >>> out.fillna('')
  9. 0 1 2 3 4 5
  10. 1 5.29559 2.38176 0.51521 0.04454
  11. 4 4.32454 1.77600 0.04454
  12. 7 1.55590 0.00000
  13. 10 5.37285 4.39339 3.56905 0.8323 0.04454
  14. 13 4.22062 1.60321 0.0

Update

Since you skip the first row, you can also use indexing by position:

  1. out = df.iloc[1::3].where(df.iloc[2::3].eq(0).set_index(df.iloc[1::3].index))

To be reproducible:

  1. s = '0.0\t0\t5\t6.31000\t\t\n5.29559\t2.38176\t0.51521\t0.04454\t0.00000\t\n0\t0\t0\t0\t2\t\n0.0\t0\t4\t6.31000\t\t\n4.32454\t1.77600\t0.04454\t0.00000\t\t\n0\t0\t0\t2\t\t\n0.0\t0\t2\t6.31000\t\t\n1.55590\t0.00000\t\t\t\t\n0\t0\t\t\t\t\n0.0\t0\t6\t6.31000\t\t\n5.37285\t4.39339\t3.56905\t0.83230\t0.04454\t0.00000\n0\t0\t0\t0\t0\t2\n0.0\t0\t3\t6.31000\t\t\n4.22062\t1.60321\t0.00000\t\t\t\n0\t0\t0\t\t\t'
  2. with open('data.txt', 'w') as fp:
  3. print(s, file=fp)
  4. </details>

huangapple
  • 本文由 发表于 2023年4月17日 20:01:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76034946.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定