2023年5月25日 17:18:35go评论86阅读模式

英文:

Read unique values from txt file with pandas

问题

Sure, here's the translated code part:

import pandas as pd

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

If you have any more code or specific questions, please feel free to ask.

英文:

I have a txt file which is formatted in this way:

 thi is    a junk data    line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706    MARS        stu    base-1
AFKSDNG-UBI 20200706    JUPITER     uyt    base-2
AFKSDNG-ABI 20200706    MARS        stu    base-1
AFKSDNG-XBI 20200706    JUPITER     uyt    base-2
AFKSDNG-XBI 20200706    MARS        stx    base-1

Please note that I have only raw data in the txt file without any column name in the header indicating the context of each column.

Each column is separated from the other by one or more whitespaces.

So for example if I wanted to count the occurances of 'MARS' it would be 2 and not 3 because the last record has the 4th column different ('stx') from the previous ones.

I need to count all the unique occurences and produce an excel file like the following:

Column 1     Column 2     Column 3   Column 4   Column 5    Column 6 (occurences)
AFKSDNG-RBI   20200706      MARS        stu        base-1     2
AFKSDNG-UBI   20200706      JUPITER     uyt        base-2     2
AFKSDNG-ABI   20200706      MARS        stu        base-1     2
AFKSDNG-XBI   20200706      JUPITER     uyt        base-2     2
AFKSDNG-XBI   20200706      MARS        stx        base-1     1

EVEN BETTER OUTPUT WOULD BE TO REMOVE THE DUPLICATED RECORDS AFTER COUNTING THEM SO:

Column 1     Column 2     Column 3   Column 4   Column 5    Column 6 (occurences)
AFKSDNG-RBI   20200706      MARS        stu        base-1     2
AFKSDNG-UBI   20200706      JUPITER     uyt        base-2     2
AFKSDNG-XBI   20200706      MARS        stx        base-1     1

I tried writing this code in python for reading and producing an Excel:

import pandas as pd

df = pd.read_csv(&#39;CD202205.txt&#39;, engine=&#39;python&#39;, sep=&#39;\s{3,}&#39;, header=None, skiprows=1)
df.to_excel(&#39;export.xlsx&#39;, index=False, sheet_name=&#39;SHEET1&#39;)

But I cannot figure out how to count the occurences. I'm new to python and pandas so any help would be highly appreciated.

-------------------------------------UPDATE---------------------------------------

I noticed a little issue if we slightly change the source txt file.
As I stated before the last 'MARS' is different from the previous ones because the 4th column 'stx' is different. In order to be unique it only takes one column from the 3rd, 4th or the 5th one to be different.

EXAMPLE

thi is    a junk data    line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706    MARS        stu    base-1
AFKSDNG-UBI 20200706    JUPITER     uyt    base-2
AFKSDNG-ABI 20200706    MARS        stu    base-1
AFKSDNG-XBI 20200706    JUPITER     uyt    base-2
AFKSDNG-XBI 20200706    MARS        stx    base-1 // different cuz stx is different
AFKSDNG-XBI 20200706    PLUTO       stu    base-1 // even though here stu and base-1 is like &#39;MARS&#39; we have &#39;PLUTO&#39; so this is a new row

In the accepted answer of @jezrael 'PLUTO' is counted with 'MARS'

答案1

得分: 1

使用 GroupBy.transform 与 DataFrame.drop_duplicates 进行计数：

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
print (df)
                 0         1        2    3       4
0  AFKSDNG-RBI  20200706     MARS  stu  base-1
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2
2  AFKSDNG-ABI  20200706     MARS  stu  base-1
3  AFKSDNG-XBI  20200706  JUPITER  uyt  base-2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1

df['new'] = df.groupby([2,3,4])[2].transform('size')

df = df.drop_duplicates([2,3,4])
print (df)
                 0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1

df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

如果需要设置列名：

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)

f = lambda x: f'Column {x+1}'
df = df.rename(columns=f)
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5
0  AFKSDNG-RBI  20200706     MARS      stu   base-1
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2
2  AFKSDNG-ABI  20200706     MARS      stu   base-1
3  AFKSDNG-XBI  20200706  JUPITER      uyt   base-2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1

df['Column 6']=df.groupby(['Column 3','Column 4','Column 5'])['Column 3'].transform('size')

df = df.drop_duplicates(['Column 3','Column 4','Column 5'])
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5  Column 6
0  AFKSDNG-RBI  20200706     MARS      stu   base-1         2
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2         2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1         1

df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

编辑：使用新数据进行测试：

df['new'] = df.groupby([2,3,4])[2].transform('size')

df = df.drop_duplicates([2,3,4])
print (df)
                 0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1
5  AFKSDNG-XBI  20200706    PLUTO  stu  base-1    1

英文:

For count use GroupBy.transform with DataFrame.drop_duplicates:

df = pd.read_csv(&#39;CD202205.txt&#39;, engine=&#39;python&#39;, sep=&#39;\s{3,}&#39;, header=None, skiprows=1)
print (df)
             0         1        2    3       4
0  AFKSDNG-RBI  20200706     MARS  stu  base-1
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2
2  AFKSDNG-ABI  20200706     MARS  stu  base-1
3  AFKSDNG-XBI  20200706  JUPITER  uyt  base-2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1

df[&#39;new&#39;] = df.groupby([2,3,4])[2].transform(&#39;size&#39;)

df = df.drop_duplicates([2,3,4])
print (df)
             0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1

df.to_excel(&#39;export.xlsx&#39;, index=False, sheet_name=&#39;SHEET1&#39;)

If need set columns names:

df = pd.read_csv(&#39;CD202205.txt&#39;, engine=&#39;python&#39;, sep=&#39;\s{3,}&#39;, header=None, skiprows=1)

f = lambda x: f&#39;Column {x+1}&#39;
df = df.rename(columns=f)
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5
0  AFKSDNG-RBI  20200706     MARS      stu   base-1
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2
2  AFKSDNG-ABI  20200706     MARS      stu   base-1
3  AFKSDNG-XBI  20200706  JUPITER      uyt   base-2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1

df[&#39;Column 6&#39;]=df.groupby([&#39;Column 3&#39;,&#39;Column 4&#39;,&#39;Column 5&#39;])[&#39;Column 3&#39;].transform(&#39;size&#39;)

df = df.drop_duplicates([&#39;Column 3&#39;,&#39;Column 4&#39;,&#39;Column 5&#39;])
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5  Column 6
0  AFKSDNG-RBI  20200706     MARS      stu   base-1         2
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2         2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1         1

df.to_excel(&#39;export.xlsx&#39;, index=False, sheet_name=&#39;SHEET1&#39;)

EDIT: Test with new data:

df[&#39;new&#39;] = df.groupby([2,3,4])[2].transform(&#39;size&#39;)

df = df.drop_duplicates([2,3,4])
print (df)
             0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1
5  AFKSDNG-XBI  20200706    PLUTO  stu  base-1    1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从txt文件中使用pandas读取唯一值

问题

答案1

如何从一个张量中提取张量，并将其转换成一个二维NumPy数组？

How can I use pandas.query() to check if a string exists in a list within the dataframe?

StringParam 在 Python Cloud Function Gen2 中无法用于全局变量。

snakemake 内置 md5sum 函数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论