如何删除分隔符的最后一个出现位置之后的所有内容?

huangapple go评论66阅读模式
英文:

How to remove everything after the last occurrence of a delimiter?

问题

我想删除HTAN Parent Biospecimen ID列中最后一个出现的_分隔符后的所有内容。

import pandas as pd
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)

数据:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07',
  2: 'HTA10_07',
  3: 'HTA10_07'}})

期望输出:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07',
  2: 'HTA10_07',
  3: 'HTA10_07'}})
英文:

I want to remove everything after the last occurrence of the _ delimiter in the HTAN Parent Biospecimen ID column.

import pandas as pd
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [41], in <cell line: 3>()
      1 # BulkRNA-seqLevel1
      2 df_2 = pd.read_csv("syn39282161.csv", sep=",")
----> 3 df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)
      4 df_2.head()

File ~/.local/lib/python3.9/site-packages/pandas/core/strings/accessor.py:129, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    124     msg = (
    125         f"Cannot use .str.{func_name} with values of "
    126         f"inferred dtype '{self._inferred_dtype}'."
    127     )
    128     raise TypeError(msg)
--> 129 return func(self, *args, **kwargs)

TypeError: rsplit() takes from 1 to 2 positional arguments but 3 were given

Data:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07_001',
  2: 'HTA10_07_006',
  3: 'HTA10_07_006'}})

Expected output:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07',
  2: 'HTA10_07',
  3: 'HTA10_07'}})

答案1

得分: 1

"尝试这样做:

df_2["HTAN 父类样本编号"] = df_2["HTAN 父类样本编号"].apply(lambda x:"_".join(x.split("_")[:-1]))"
英文:

try this:

df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].apply(lambda x:"_".join(x.split("_")[:-1]))

答案2

得分: 0

较早版本的pandas中,patn是位置参数,因此您可以执行.rsplit('_', 1)并且它可以正常工作。例如,查看pandas 1.0中的函数签名文档

Series.str.rsplit(self, pat=None, n=-1, expand=False)

较新版本定义了n为只能通过关键字参数传递,因此您现在必须显式定义n=1,而不仅仅是位置参数1。查看pandas 2.0中的文档

Series.str.rsplit(pat=None, *, n=-1, expand=False)

请注意,在pat=None之后定义了*,这表示现在只能通过关键字参数传递参数n

简而言之,您需要从

df_2[col].str.rsplit("_", 1).str.get(0)

更改为

df_2[col].str.rsplit("_", n=1).str.get(0)

这样,它将适用于所有版本的pandas。

英文:

Earlier versions of pandas had pat and n as positional arguments, such that you could do .rsplit('_', 1) and it would work well. For example, take a look at the docs for the function signature for .str.rsplit @ pandas 1.0:

> Series.str.rsplit(self, pat=None, n=- 1, expand=False)

Newer versions have defined n to be a keyword-only argument, such that you have to define n=1 explicitly now, instead of just using 1 positionally. Take the docs for .str.rsplit @ pandas 2.0:

> Series.str.rsplit(pat=None, *, n=- 1, expand=False)

Notice how * is defined after pat=None, indicating that the only way to pass the parameter n now is via a keyword arg.

In a nutshell, you have to change from

df_2[col].str.rsplit("_", 1).str.get(0)

to

df_2[col].str.rsplit("_", n=1).str.get(0)

and that way, it will work for all pandas versions.

答案3

得分: 0

你可以使用 str.replace

>>> df['HTAN Parent Biospecimen ID'].str.replace('_\d+$', '', regex=True)
0    HTA10_07
1    HTA10_07
2    HTA10_07
3    HTA10_07
Name: HTAN Parent Biospecimen ID, dtype: object

关于正则表达式的解释:Regex 101

英文:

You can use str.replace:

>>> df['HTAN Parent Biospecimen ID'].str.replace('_\d+$', '', regex=True)
0    HTA10_07
1    HTA10_07
2    HTA10_07
3    HTA10_07
Name: HTAN Parent Biospecimen ID, dtype: object

Explanation about regex: Regex 101

huangapple
  • 本文由 发表于 2023年5月10日 21:17:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76218923.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定