英文:
Read unique values from txt file with pandas
问题
Sure, here's the translated code part:
import pandas as pd
df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')
If you have any more code or specific questions, please feel free to ask.
英文:
I have a txt file which is formatted in this way:
thi is a junk data line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706 MARS stu base-1
AFKSDNG-UBI 20200706 JUPITER uyt base-2
AFKSDNG-ABI 20200706 MARS stu base-1
AFKSDNG-XBI 20200706 JUPITER uyt base-2
AFKSDNG-XBI 20200706 MARS stx base-1
Please note that I have only raw data in the txt file without any column name in the header indicating the context of each column.
Each column is separated from the other by one or more whitespaces.
So for example if I wanted to count the occurances of 'MARS' it would be 2 and not 3 because the last record has the 4th column different ('stx') from the previous ones.
I need to count all the unique occurences and produce an excel file like the following:
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 (occurences)
AFKSDNG-RBI 20200706 MARS stu base-1 2
AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
AFKSDNG-ABI 20200706 MARS stu base-1 2
AFKSDNG-XBI 20200706 JUPITER uyt base-2 2
AFKSDNG-XBI 20200706 MARS stx base-1 1
EVEN BETTER OUTPUT WOULD BE TO REMOVE THE DUPLICATED RECORDS AFTER COUNTING THEM SO:
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 (occurences)
AFKSDNG-RBI 20200706 MARS stu base-1 2
AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
AFKSDNG-XBI 20200706 MARS stx base-1 1
I tried writing this code in python for reading and producing an Excel:
import pandas as pd
df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')
But I cannot figure out how to count the occurences. I'm new to python and pandas so any help would be highly appreciated.
-------------------------------------UPDATE---------------------------------------
I noticed a little issue if we slightly change the source txt file.
As I stated before the last 'MARS' is different from the previous ones because the 4th column 'stx' is different. In order to be unique it only takes one column from the 3rd, 4th or the 5th one to be different.
EXAMPLE
thi is a junk data line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706 MARS stu base-1
AFKSDNG-UBI 20200706 JUPITER uyt base-2
AFKSDNG-ABI 20200706 MARS stu base-1
AFKSDNG-XBI 20200706 JUPITER uyt base-2
AFKSDNG-XBI 20200706 MARS stx base-1 // different cuz stx is different
AFKSDNG-XBI 20200706 PLUTO stu base-1 // even though here stu and base-1 is like 'MARS' we have 'PLUTO' so this is a new row
In the accepted answer of @jezrael 'PLUTO' is counted with 'MARS'
答案1
得分: 1
使用 GroupBy.transform
与 DataFrame.drop_duplicates
进行计数:
df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
print (df)
0 1 2 3 4
0 AFKSDNG-RBI 20200706 MARS stu base-1
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2
2 AFKSDNG-ABI 20200706 MARS stu base-1
3 AFKSDNG-XBI 20200706 JUPITER uyt base-2
4 AFKSDNG-XBI 20200706 MARS stx base-1
df['new'] = df.groupby([2,3,4])[2].transform('size')
df = df.drop_duplicates([2,3,4])
print (df)
0 1 2 3 4 new
0 AFKSDNG-RBI 20200706 MARS stu base-1 2
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
4 AFKSDNG-XBI 20200706 MARS stx base-1 1
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')
如果需要设置列名:
df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
f = lambda x: f'Column {x+1}'
df = df.rename(columns=f)
print (df)
Column 1 Column 2 Column 3 Column 4 Column 5
0 AFKSDNG-RBI 20200706 MARS stu base-1
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2
2 AFKSDNG-ABI 20200706 MARS stu base-1
3 AFKSDNG-XBI 20200706 JUPITER uyt base-2
4 AFKSDNG-XBI 20200706 MARS stx base-1
df['Column 6']=df.groupby(['Column 3','Column 4','Column 5'])['Column 3'].transform('size')
df = df.drop_duplicates(['Column 3','Column 4','Column 5'])
print (df)
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6
0 AFKSDNG-RBI 20200706 MARS stu base-1 2
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
4 AFKSDNG-XBI 20200706 MARS stx base-1 1
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')
编辑:使用新数据进行测试:
df['new'] = df.groupby([2,3,4])[2].transform('size')
df = df.drop_duplicates([2,3,4])
print (df)
0 1 2 3 4 new
0 AFKSDNG-RBI 20200706 MARS stu base-1 2
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
4 AFKSDNG-XBI 20200706 MARS stx base-1 1
5 AFKSDNG-XBI 20200706 PLUTO stu base-1 1
英文:
For count use GroupBy.transform
with DataFrame.drop_duplicates
:
df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
print (df)
0 1 2 3 4
0 AFKSDNG-RBI 20200706 MARS stu base-1
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2
2 AFKSDNG-ABI 20200706 MARS stu base-1
3 AFKSDNG-XBI 20200706 JUPITER uyt base-2
4 AFKSDNG-XBI 20200706 MARS stx base-1
df['new'] = df.groupby([2,3,4])[2].transform('size')
df = df.drop_duplicates([2,3,4])
print (df)
0 1 2 3 4 new
0 AFKSDNG-RBI 20200706 MARS stu base-1 2
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
4 AFKSDNG-XBI 20200706 MARS stx base-1 1
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')
If need set columns names:
df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
f = lambda x: f'Column {x+1}'
df = df.rename(columns=f)
print (df)
Column 1 Column 2 Column 3 Column 4 Column 5
0 AFKSDNG-RBI 20200706 MARS stu base-1
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2
2 AFKSDNG-ABI 20200706 MARS stu base-1
3 AFKSDNG-XBI 20200706 JUPITER uyt base-2
4 AFKSDNG-XBI 20200706 MARS stx base-1
df['Column 6']=df.groupby(['Column 3','Column 4','Column 5'])['Column 3'].transform('size')
df = df.drop_duplicates(['Column 3','Column 4','Column 5'])
print (df)
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6
0 AFKSDNG-RBI 20200706 MARS stu base-1 2
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
4 AFKSDNG-XBI 20200706 MARS stx base-1 1
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')
EDIT: Test with new data:
df['new'] = df.groupby([2,3,4])[2].transform('size')
df = df.drop_duplicates([2,3,4])
print (df)
0 1 2 3 4 new
0 AFKSDNG-RBI 20200706 MARS stu base-1 2
1 AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
4 AFKSDNG-XBI 20200706 MARS stx base-1 1
5 AFKSDNG-XBI 20200706 PLUTO stu base-1 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论