用Pandas在Python中重塑和清理制表符分隔的数据文件

huangapple go评论75阅读模式
英文:

Reshaping and Cleaning a Tab-Delimited Data File with Pandas in Python

问题

我试过了但没有得到预期的结果,我无法找出我的错误,有人能帮忙解决这个问题吗?

英文:

I have a txt file which has this format-

0.0  0    5    6.31000
      5.29559    2.38176    0.51521    0.04454    0.00000
            0          0          0          0          2
  0.0  0    4    6.31000
      4.32454    1.77600    0.04454    0.00000
            0          0          0          2
  0.0  0    2    6.31000
      1.55590    0.00000
            0          0
  0.0  0    6    6.31000
      5.37285    4.39339    3.56905    0.83230    0.04454    0.00000
            0          0          0          0          0          2
  0.0  0    3    6.31000
      4.22062    1.60321    0.00000
            0          0          0

I am trying to create a data frame by removing lines 1,4,7,10 so on..., and only taking values from the following line which has the corresponding value 0 in the following line.

For example, I will want to store 5.29559 because in the following line, the value corresponding is 0 but I do not want 0.00000 because the corresponding value to it is 2 and no 0. In short, I want to create a data frame which will look like this-


  5.29559    2.38176    0.51521    0.04454    
  
  4.32454    1.77600    0.04454    
            
  1.55590    0.00000
   
  5.37285    4.39339    3.56905    0.83230    0.04454   
           
  4.22062    1.60321    0.00000
 

So far I have tried-

import pandas as pd

filename = r"data.txt"

# This returns a dataframe with a single column
df = pd.read_table(filename, header=None)
# remove first row
df = df.drop(index=0)                                                                                                                                                                                            df = df.drop(df.index[df.index % 3 == 0])
# remove spaces from beginning of all rows
df = df.applymap(lambda x: x.lstrip())
df.to_csv('file_with_quotes.txt', sep=' ', index=False)
with open('file_with_quotes.txt', 'r') as f1, open('file_without_quotes.txt', 'w') as f2:
    line_num = 0
    for line in f1:
        if line_num == 0:
            line_num += 1
            continue
        line = line.strip().replace('"', '')
        f2.write(line + '\n')
import os
#delete the file using os.remove()
os.remove('file_with_quotes.txt')
df_new = pd.read_table('file_without_quotes.txt', header=None)
df_new = df[0].str.split("  ", expand=True)

df_new = df_new.replace('', pd.np.nan)
# replace NaN values with empty string
df_new = df_new.fillna("")

# remove empty strings using applymap()
df_new = df_new.applymap(lambda x: x if x != "" else None)

(df_new.stack()
   .groupby(level=0)
   .apply(lambda df: df.reset_index(drop=True))
   .unstack())

But I still do not see the expected result and am not able to figure out my mistake, can someone please help solve the problem?

答案1

得分: 1

Assuming a right row doesn't start by 0 and the following line acts as a mask:

df = pd.read_table('data.txt', header=None)
m1 = df[0] > 0
m2 = m1.shift(fill_value=False)
out = df[m1].where(df[m2].eq(0).set_index(df[m1].index))

Output:

>>> out
          0        1        2        3        4   5
1   5.29559  2.38176  0.51521  0.04454      NaN NaN
4   4.32454  1.77600  0.04454      NaN      NaN NaN
7   1.55590  0.00000      NaN      NaN      NaN NaN
10  5.37285  4.39339  3.56905  0.83230  0.04454 NaN
13  4.22062  1.60321  0.00000      NaN      NaN NaN

>>> out.fillna('')
          0        1        2        3        4 5
1   5.29559  2.38176  0.51521  0.04454           
4   4.32454  1.77600  0.04454                    
7   1.55590  0.00000                             
10  5.37285  4.39339  3.56905   0.8323  0.04454  
13  4.22062  1.60321      0.0                    

Update

Since you skip the first row, you can also use indexing by position:

out = df.iloc[1::3].where(df.iloc[2::3].eq(0).set_index(df.iloc[1::3].index))

To be reproducible:

s = '0.0\t0\t5\t6.31000\t\t\n5.29559\t2.38176\t0.51521\t0.04454\t0.00000\t\n0\t0\t0\t0\t2\t\n0.0\t0\t4\t6.31000\t\t\n4.32454\t1.77600\t0.04454\t0.00000\t\t\n0\t0\t0\t2\t\t\n0.0\t0\t2\t6.31000\t\t\n1.55590\t0.00000\t\t\t\t\n0\t0\t\t\t\t\n0.0\t0\t6\t6.31000\t\t\n5.37285\t4.39339\t3.56905\t0.83230\t0.04454\t0.00000\n0\t0\t0\t0\t0\t2\n0.0\t0\t3\t6.31000\t\t\n4.22062\t1.60321\t0.00000\t\t\t\n0\t0\t0\t\t\t'

with open('data.txt', 'w') as fp:
    print(s, file=fp)
英文:

Assuming a right row doesn't start by 0 and the following line act as a mask:

df = pd.read_table('data.txt', header=None)
m1 = df[0] > 0
m2 = m1.shift(fill_value=False)
out = df[m1].where(df[m2].eq(0).set_index(df[m1].index))

Output:

>>> out
          0        1        2        3        4   5
1   5.29559  2.38176  0.51521  0.04454      NaN NaN
4   4.32454  1.77600  0.04454      NaN      NaN NaN
7   1.55590  0.00000      NaN      NaN      NaN NaN
10  5.37285  4.39339  3.56905  0.83230  0.04454 NaN
13  4.22062  1.60321  0.00000      NaN      NaN NaN

>>> out.fillna('')
          0        1        2        3        4 5
1   5.29559  2.38176  0.51521  0.04454           
4   4.32454  1.77600  0.04454                    
7   1.55590  0.00000                             
10  5.37285  4.39339  3.56905   0.8323  0.04454  
13  4.22062  1.60321      0.0                    

Update

Since you skip the first row, you can also use indexing by position:

out = df.iloc[1::3].where(df.iloc[2::3].eq(0).set_index(df.iloc[1::3].index))

To be reproducible:

s = '0.0\t0\t5\t6.31000\t\t\n5.29559\t2.38176\t0.51521\t0.04454\t0.00000\t\n0\t0\t0\t0\t2\t\n0.0\t0\t4\t6.31000\t\t\n4.32454\t1.77600\t0.04454\t0.00000\t\t\n0\t0\t0\t2\t\t\n0.0\t0\t2\t6.31000\t\t\n1.55590\t0.00000\t\t\t\t\n0\t0\t\t\t\t\n0.0\t0\t6\t6.31000\t\t\n5.37285\t4.39339\t3.56905\t0.83230\t0.04454\t0.00000\n0\t0\t0\t0\t0\t2\n0.0\t0\t3\t6.31000\t\t\n4.22062\t1.60321\t0.00000\t\t\t\n0\t0\t0\t\t\t'

with open('data.txt', 'w') as fp:
    print(s, file=fp)

</details>



huangapple
  • 本文由 发表于 2023年4月17日 20:01:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76034946.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定