2023年4月17日 20:01:40go评论105阅读模式

英文:

Reshaping and Cleaning a Tab-Delimited Data File with Pandas in Python

问题

我试过了但没有得到预期的结果，我无法找出我的错误，有人能帮忙解决这个问题吗？

英文:

I have a txt file which has this format-

0.0  0    5    6.31000
      5.29559    2.38176    0.51521    0.04454    0.00000
            0          0          0          0          2
  0.0  0    4    6.31000
      4.32454    1.77600    0.04454    0.00000
            0          0          0          2
  0.0  0    2    6.31000
      1.55590    0.00000
            0          0
  0.0  0    6    6.31000
      5.37285    4.39339    3.56905    0.83230    0.04454    0.00000
            0          0          0          0          0          2
  0.0  0    3    6.31000
      4.22062    1.60321    0.00000
            0          0          0

I am trying to create a data frame by removing lines 1,4,7,10 so on..., and only taking values from the following line which has the corresponding value 0 in the following line.

For example, I will want to store 5.29559 because in the following line, the value corresponding is 0 but I do not want 0.00000 because the corresponding value to it is 2 and no 0. In short, I want to create a data frame which will look like this-


  5.29559    2.38176    0.51521    0.04454    
  
  4.32454    1.77600    0.04454    
            
  1.55590    0.00000
   
  5.37285    4.39339    3.56905    0.83230    0.04454   
           
  4.22062    1.60321    0.00000

So far I have tried-

import pandas as pd
filename = r&quot;data.txt&quot;
# This returns a dataframe with a single column
df = pd.read_table(filename, header=None)
# remove first row
df = df.drop(index=0)                                                                                                                                                                                            df = df.drop(df.index[df.index % 3 == 0])
# remove spaces from beginning of all rows
df = df.applymap(lambda x: x.lstrip())
df.to_csv(&#39;file_with_quotes.txt&#39;, sep=&#39; &#39;, index=False)
with open(&#39;file_with_quotes.txt&#39;, &#39;r&#39;) as f1, open(&#39;file_without_quotes.txt&#39;, &#39;w&#39;) as f2:
    line_num = 0
    for line in f1:
        if line_num == 0:
            line_num += 1
            continue
        line = line.strip().replace(&#39;&quot;&#39;, &#39;&#39;)
        f2.write(line + &#39;\n&#39;)
import os
#delete the file using os.remove()
os.remove(&#39;file_with_quotes.txt&#39;)
df_new = pd.read_table(&#39;file_without_quotes.txt&#39;, header=None)
df_new = df[0].str.split(&quot;  &quot;, expand=True)
df_new = df_new.replace(&#39;&#39;, pd.np.nan)
# replace NaN values with empty string
df_new = df_new.fillna(&quot;&quot;)
# remove empty strings using applymap()
df_new = df_new.applymap(lambda x: x if x != &quot;&quot; else None)
(df_new.stack()
   .groupby(level=0)
   .apply(lambda df: df.reset_index(drop=True))
   .unstack())

But I still do not see the expected result and am not able to figure out my mistake, can someone please help solve the problem?

答案1

得分: 1

Assuming a right row doesn't start by 0 and the following line acts as a mask:

df = pd.read_table('data.txt', header=None)
m1 = df[0] > 0
m2 = m1.shift(fill_value=False)
out = df[m1].where(df[m2].eq(0).set_index(df[m1].index))

Output:

>>> out
          0        1        2        3        4   5
1   5.29559  2.38176  0.51521  0.04454      NaN NaN
4   4.32454  1.77600  0.04454      NaN      NaN NaN
7   1.55590  0.00000      NaN      NaN      NaN NaN
10  5.37285  4.39339  3.56905  0.83230  0.04454 NaN
13  4.22062  1.60321  0.00000      NaN      NaN NaN
>>> out.fillna('')
          0        1        2        3        4 5
1   5.29559  2.38176  0.51521  0.04454           
4   4.32454  1.77600  0.04454                    
7   1.55590  0.00000                             
10  5.37285  4.39339  3.56905   0.8323  0.04454  
13  4.22062  1.60321      0.0

Update

Since you skip the first row, you can also use indexing by position:

out = df.iloc[1::3].where(df.iloc[2::3].eq(0).set_index(df.iloc[1::3].index))

To be reproducible:

s = '0.0\t0\t5\t6.31000\t\t\n5.29559\t2.38176\t0.51521\t0.04454\t0.00000\t\n0\t0\t0\t0\t2\t\n0.0\t0\t4\t6.31000\t\t\n4.32454\t1.77600\t0.04454\t0.00000\t\t\n0\t0\t0\t2\t\t\n0.0\t0\t2\t6.31000\t\t\n1.55590\t0.00000\t\t\t\t\n0\t0\t\t\t\t\n0.0\t0\t6\t6.31000\t\t\n5.37285\t4.39339\t3.56905\t0.83230\t0.04454\t0.00000\n0\t0\t0\t0\t0\t2\n0.0\t0\t3\t6.31000\t\t\n4.22062\t1.60321\t0.00000\t\t\t\n0\t0\t0\t\t\t'
with open('data.txt', 'w') as fp:
    print(s, file=fp)

英文:

Assuming a right row doesn't start by 0 and the following line act as a mask:

df = pd.read_table(&#39;data.txt&#39;, header=None)
m1 = df[0] &gt; 0
m2 = m1.shift(fill_value=False)
out = df[m1].where(df[m2].eq(0).set_index(df[m1].index))

Output:

&gt;&gt;&gt; out
          0        1        2        3        4   5
1   5.29559  2.38176  0.51521  0.04454      NaN NaN
4   4.32454  1.77600  0.04454      NaN      NaN NaN
7   1.55590  0.00000      NaN      NaN      NaN NaN
10  5.37285  4.39339  3.56905  0.83230  0.04454 NaN
13  4.22062  1.60321  0.00000      NaN      NaN NaN
&gt;&gt;&gt; out.fillna(&#39;&#39;)
          0        1        2        3        4 5
1   5.29559  2.38176  0.51521  0.04454           
4   4.32454  1.77600  0.04454                    
7   1.55590  0.00000                             
10  5.37285  4.39339  3.56905   0.8323  0.04454  
13  4.22062  1.60321      0.0

Update

Since you skip the first row, you can also use indexing by position:

out = df.iloc[1::3].where(df.iloc[2::3].eq(0).set_index(df.iloc[1::3].index))

To be reproducible:

s = &#39;0.0\t0\t5\t6.31000\t\t\n5.29559\t2.38176\t0.51521\t0.04454\t0.00000\t\n0\t0\t0\t0\t2\t\n0.0\t0\t4\t6.31000\t\t\n4.32454\t1.77600\t0.04454\t0.00000\t\t\n0\t0\t0\t2\t\t\n0.0\t0\t2\t6.31000\t\t\n1.55590\t0.00000\t\t\t\t\n0\t0\t\t\t\t\n0.0\t0\t6\t6.31000\t\t\n5.37285\t4.39339\t3.56905\t0.83230\t0.04454\t0.00000\n0\t0\t0\t0\t0\t2\n0.0\t0\t3\t6.31000\t\t\n4.22062\t1.60321\t0.00000\t\t\t\n0\t0\t0\t\t\t&#39;
with open(&#39;data.txt&#39;, &#39;w&#39;) as fp:
    print(s, file=fp)
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用Pandas在Python中重塑和清理制表符分隔的数据文件

问题

答案1

TypeError: Field 'id' expected a number but got DeferredAttribute object at 0x000002B6ADE878D0

如何在Python中匹配多个不同情况

如何使用Go语言验证明文密码是否与盐值MD5密码相同？

Python ProcessPoolExecutor 问题

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。