删除具有两个空行的列,每次出现一系列的0时。

huangapple go评论73阅读模式
英文:

Pandas: Dropping columns with 2 blank lines every time a series of 0 appear

问题

我有一大块数据,有很多列,但这些列在某些点上都为0。每当在列"two"中出现0时,我希望将该列删除,并在下面留下2个空行。


one  two three 
1     4     4
3     5     5
5     7     5
666   0     6
785   0     8 
455   0     9 
454   0     9
12    2     8
23    5     9
2     3     7
1     5     5 
123   0     7 
123   0     7
3     5     5
 

(desired)
output:


one  two three 
1     4     4
3     5     5
5     7     5


12    2     8
23    5     9
2     3     7
1     5     5 


3     5     5



英文:

I have a big chunk of data, with a lot of columns, but this columns present at some points 0. I want every time that a 0 appears in the column "two", that column to drop it with 2 blank lines.


one  two three 
1     4     4
3     5     5
5     7     5
666   0     6
785   0     8 
455   0     9 
454   0     9
12    2     8
23    5     9
2     3     7
1     5     5 
123   0     7 
123   0     7
3     5     5
 

(desired)
output:


one  two three 
1     4     4
3     5     5
5     7     5


12    2     8
23    5     9
2     3     7
1     5     5 


3     5     5






I tried different function: split, groupby, drop with conditions......., but noone didn't meet my request (most probably because I suck at coding)

答案1

得分: 1

以下是您要翻译的代码部分:

使用自定义的 groupby

m = df['two'].ne(0)

out = (df[m]
        .groupby((m & ~m.shift(fill_value=False)).cumsum(), group_keys=False)
        .apply(lambda g: pd.concat([g, pd.DataFrame('', columns=g.columns, index=[0, 1])]))
        .reset_index(drop=True)
      )

print(out.to_string(index=False))

或者使用一个巧妙的 repeat 方法:

N = 2

m1 = df['two'].ne(0)
m2 = (m1 & ~m1.shift(fill_value=True))
idx = df.index[m1].repeat(m2[m1]*N+1)

out = df.loc[idx]
out[out.index.duplicated()] = ''

print(out.to_string(index=False))

输出:

one two three
  1   4     4
  3   5     5
  5   7     5
             

 12   2     8
 23   5     9
  2   3     7
  1   5     5
             

  3   5     5
             

             

英文:

You can use a custom groupby:

m = df['two'].ne(0)

out = (df[m]
        .groupby((m & ~m.shift(fill_value=False)).cumsum(), group_keys=False)
        .apply(lambda g: pd.concat([g, pd.DataFrame('', columns=g.columns, index=[0, 1])]))
        .reset_index(drop=True)
      )

print(out.to_string(index=False))

Or with a hacky repeat:

N = 2

m1 = df['two'].ne(0)
m2 = (m1 & ~m1.shift(fill_value=True))
idx = df.index[m1].repeat(m2[m1]*N+1)

out = df.loc[idx]
out[out.index.duplicated()] = ''

print(out.to_string(index=False))

Output:

one two three
  1   4     4
  3   5     5
  5   7     5
             
             
 12   2     8
 23   5     9
  2   3     7
  1   5     5
             
             
  3   5     5
             
             

答案2

得分: 0

pd.Index.union() 方法有一个 sort 选项,所以在重新索引时,它应该按照正确的顺序进行操作。

m = df['two'].ne(0)
df.reindex(df.loc[m].index.union(df.loc[m.diff().ne(0) & ~m].index.repeat(2) + .5)).reset_index(drop=True)

输出:

     one  two  three
0    1.0  4.0    4.0
1    3.0  5.0    5.0
2    5.0  7.0    5.0
3    NaN  NaN    NaN
4    NaN  NaN    NaN
5   12.0  2.0    8.0
6   23.0  5.0    9.0
7    2.0  3.0    7.0
8    1.0  5.0    5.0
9    NaN  NaN    NaN
10   NaN  NaN    NaN
11   3.0  5.0    5.0
英文:

Here is another way:

pd.Index.union() has a sort option, so when reindexing, it should be in the correct order

m = df['two'].ne(0)
df.reindex(df.loc[m].index.union(df.loc[m.diff().ne(0) & ~m].index.repeat(2) + .5)).reset_index(drop=True)

Output:

     one  two  three
0    1.0  4.0    4.0
1    3.0  5.0    5.0
2    5.0  7.0    5.0
3    NaN  NaN    NaN
4    NaN  NaN    NaN
5   12.0  2.0    8.0
6   23.0  5.0    9.0
7    2.0  3.0    7.0
8    1.0  5.0    5.0
9    NaN  NaN    NaN
10   NaN  NaN    NaN
11   3.0  5.0    5.0

huangapple
  • 本文由 发表于 2023年5月11日 19:31:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76227165.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定