从txt文件中使用pandas读取唯一值

huangapple go评论73阅读模式
英文:

Read unique values from txt file with pandas

问题

Sure, here's the translated code part:

import pandas as pd

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

If you have any more code or specific questions, please feel free to ask.

英文:

I have a txt file which is formatted in this way:

 thi is    a junk data    line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706    MARS        stu    base-1
AFKSDNG-UBI 20200706    JUPITER     uyt    base-2
AFKSDNG-ABI 20200706    MARS        stu    base-1
AFKSDNG-XBI 20200706    JUPITER     uyt    base-2
AFKSDNG-XBI 20200706    MARS        stx    base-1

Please note that I have only raw data in the txt file without any column name in the header indicating the context of each column.

Each column is separated from the other by one or more whitespaces.

So for example if I wanted to count the occurances of 'MARS' it would be 2 and not 3 because the last record has the 4th column different ('stx') from the previous ones.

I need to count all the unique occurences and produce an excel file like the following:

Column 1     Column 2     Column 3   Column 4   Column 5    Column 6 (occurences)
AFKSDNG-RBI   20200706      MARS        stu        base-1     2
AFKSDNG-UBI   20200706      JUPITER     uyt        base-2     2
AFKSDNG-ABI   20200706      MARS        stu        base-1     2
AFKSDNG-XBI   20200706      JUPITER     uyt        base-2     2
AFKSDNG-XBI   20200706      MARS        stx        base-1     1

EVEN BETTER OUTPUT WOULD BE TO REMOVE THE DUPLICATED RECORDS AFTER COUNTING THEM SO:

Column 1     Column 2     Column 3   Column 4   Column 5    Column 6 (occurences)
AFKSDNG-RBI   20200706      MARS        stu        base-1     2
AFKSDNG-UBI   20200706      JUPITER     uyt        base-2     2
AFKSDNG-XBI   20200706      MARS        stx        base-1     1

I tried writing this code in python for reading and producing an Excel:

import pandas as pd

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

But I cannot figure out how to count the occurences. I'm new to python and pandas so any help would be highly appreciated.

-------------------------------------UPDATE---------------------------------------

I noticed a little issue if we slightly change the source txt file.
As I stated before the last 'MARS' is different from the previous ones because the 4th column 'stx' is different. In order to be unique it only takes one column from the 3rd, 4th or the 5th one to be different.

EXAMPLE

thi is    a junk data    line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706    MARS        stu    base-1
AFKSDNG-UBI 20200706    JUPITER     uyt    base-2
AFKSDNG-ABI 20200706    MARS        stu    base-1
AFKSDNG-XBI 20200706    JUPITER     uyt    base-2
AFKSDNG-XBI 20200706    MARS        stx    base-1 // different cuz stx is different
AFKSDNG-XBI 20200706    PLUTO       stu    base-1 // even though here stu and base-1 is like 'MARS' we have 'PLUTO' so this is a new row

In the accepted answer of @jezrael 'PLUTO' is counted with 'MARS'

答案1

得分: 1

使用 GroupBy.transformDataFrame.drop_duplicates 进行计数:

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
print (df)
                 0         1        2    3       4
0  AFKSDNG-RBI  20200706     MARS  stu  base-1
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2
2  AFKSDNG-ABI  20200706     MARS  stu  base-1
3  AFKSDNG-XBI  20200706  JUPITER  uyt  base-2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1

df['new'] = df.groupby([2,3,4])[2].transform('size')

df = df.drop_duplicates([2,3,4])
print (df)
                 0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1

df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

如果需要设置列名:

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)

f = lambda x: f'Column {x+1}'
df = df.rename(columns=f)
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5
0  AFKSDNG-RBI  20200706     MARS      stu   base-1
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2
2  AFKSDNG-ABI  20200706     MARS      stu   base-1
3  AFKSDNG-XBI  20200706  JUPITER      uyt   base-2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1

df['Column 6']=df.groupby(['Column 3','Column 4','Column 5'])['Column 3'].transform('size')

df = df.drop_duplicates(['Column 3','Column 4','Column 5'])
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5  Column 6
0  AFKSDNG-RBI  20200706     MARS      stu   base-1         2
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2         2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1         1

df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

编辑:使用新数据进行测试:

df['new'] = df.groupby([2,3,4])[2].transform('size')

df = df.drop_duplicates([2,3,4])
print (df)
                 0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1
5  AFKSDNG-XBI  20200706    PLUTO  stu  base-1    1
英文:

For count use GroupBy.transform with DataFrame.drop_duplicates:

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
print (df)
             0         1        2    3       4
0  AFKSDNG-RBI  20200706     MARS  stu  base-1
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2
2  AFKSDNG-ABI  20200706     MARS  stu  base-1
3  AFKSDNG-XBI  20200706  JUPITER  uyt  base-2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1

df['new'] = df.groupby([2,3,4])[2].transform('size')

df = df.drop_duplicates([2,3,4])
print (df)
             0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1

df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

If need set columns names:

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)

f = lambda x: f'Column {x+1}'
df = df.rename(columns=f)
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5
0  AFKSDNG-RBI  20200706     MARS      stu   base-1
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2
2  AFKSDNG-ABI  20200706     MARS      stu   base-1
3  AFKSDNG-XBI  20200706  JUPITER      uyt   base-2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1

df['Column 6']=df.groupby(['Column 3','Column 4','Column 5'])['Column 3'].transform('size')

df = df.drop_duplicates(['Column 3','Column 4','Column 5'])
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5  Column 6
0  AFKSDNG-RBI  20200706     MARS      stu   base-1         2
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2         2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1         1

df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

EDIT: Test with new data:

df['new'] = df.groupby([2,3,4])[2].transform('size')

df = df.drop_duplicates([2,3,4])
print (df)
             0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1
5  AFKSDNG-XBI  20200706    PLUTO  stu  base-1    1

huangapple
  • 本文由 发表于 2023年5月25日 17:18:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76330686.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定