英文:
How to remove everything after the last occurrence of a delimiter?
问题
我想删除HTAN Parent Biospecimen ID
列中最后一个出现的_
分隔符后的所有内容。
import pandas as pd
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)
数据:
pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
1: 'BulkRNA-seqLevel1',
2: 'BulkRNA-seqLevel1',
3: 'BulkRNA-seqLevel1'},
'Filename': {0: 'B001A001_1.fq.gz',
1: 'B001A001_2.fq.gz',
2: 'B001A006_1.fq.gz',
3: 'B001A006_2.fq.gz'},
'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
1: 'HTA10_07',
2: 'HTA10_07',
3: 'HTA10_07'}})
期望输出:
pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
1: 'BulkRNA-seqLevel1',
2: 'BulkRNA-seqLevel1',
3: 'BulkRNA-seqLevel1'},
'Filename': {0: 'B001A001_1.fq.gz',
1: 'B001A001_2.fq.gz',
2: 'B001A006_1.fq.gz',
3: 'B001A006_2.fq.gz'},
'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
1: 'HTA10_07',
2: 'HTA10_07',
3: 'HTA10_07'}})
英文:
I want to remove everything after the last occurrence of the _
delimiter in the HTAN Parent Biospecimen ID
column.
import pandas as pd
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)
Traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [41], in <cell line: 3>()
1 # BulkRNA-seqLevel1
2 df_2 = pd.read_csv("syn39282161.csv", sep=",")
----> 3 df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)
4 df_2.head()
File ~/.local/lib/python3.9/site-packages/pandas/core/strings/accessor.py:129, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
124 msg = (
125 f"Cannot use .str.{func_name} with values of "
126 f"inferred dtype '{self._inferred_dtype}'."
127 )
128 raise TypeError(msg)
--> 129 return func(self, *args, **kwargs)
TypeError: rsplit() takes from 1 to 2 positional arguments but 3 were given
Data:
pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
1: 'BulkRNA-seqLevel1',
2: 'BulkRNA-seqLevel1',
3: 'BulkRNA-seqLevel1'},
'Filename': {0: 'B001A001_1.fq.gz',
1: 'B001A001_2.fq.gz',
2: 'B001A006_1.fq.gz',
3: 'B001A006_2.fq.gz'},
'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
1: 'HTA10_07_001',
2: 'HTA10_07_006',
3: 'HTA10_07_006'}})
Expected output:
pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
1: 'BulkRNA-seqLevel1',
2: 'BulkRNA-seqLevel1',
3: 'BulkRNA-seqLevel1'},
'Filename': {0: 'B001A001_1.fq.gz',
1: 'B001A001_2.fq.gz',
2: 'B001A006_1.fq.gz',
3: 'B001A006_2.fq.gz'},
'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
1: 'HTA10_07',
2: 'HTA10_07',
3: 'HTA10_07'}})
答案1
得分: 1
"尝试这样做:
df_2["HTAN 父类样本编号"] = df_2["HTAN 父类样本编号"].apply(lambda x:"_".join(x.split("_")[:-1]))"
英文:
try this:
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].apply(lambda x:"_".join(x.split("_")[:-1]))
答案2
得分: 0
较早版本的pandas中,pat
和n
是位置参数,因此您可以执行.rsplit('_', 1)
并且它可以正常工作。例如,查看pandas 1.0中的函数签名文档:
Series.str.rsplit(self, pat=None, n=-1, expand=False)
较新版本定义了n
为只能通过关键字参数传递,因此您现在必须显式定义n=1
,而不仅仅是位置参数1
。查看pandas 2.0中的文档:
Series.str.rsplit(pat=None, *, n=-1, expand=False)
请注意,在pat=None
之后定义了*
,这表示现在只能通过关键字参数传递参数n
。
简而言之,您需要从
df_2[col].str.rsplit("_", 1).str.get(0)
更改为
df_2[col].str.rsplit("_", n=1).str.get(0)
这样,它将适用于所有版本的pandas。
英文:
Earlier versions of pandas had pat
and n
as positional arguments, such that you could do .rsplit('_', 1)
and it would work well. For example, take a look at the docs for the function signature for .str.rsplit
@ pandas 1.0:
> Series.str.rsplit(self, pat=None, n=- 1, expand=False)
Newer versions have defined n
to be a keyword-only argument, such that you have to define n=1
explicitly now, instead of just using 1
positionally. Take the docs for .str.rsplit
@ pandas 2.0:
> Series.str.rsplit(pat=None, *, n=- 1, expand=False)
Notice how *
is defined after pat=None
, indicating that the only way to pass the parameter n
now is via a keyword arg.
In a nutshell, you have to change from
df_2[col].str.rsplit("_", 1).str.get(0)
to
df_2[col].str.rsplit("_", n=1).str.get(0)
and that way, it will work for all pandas versions.
答案3
得分: 0
你可以使用 str.replace
:
>>> df['HTAN Parent Biospecimen ID'].str.replace('_\d+$', '', regex=True)
0 HTA10_07
1 HTA10_07
2 HTA10_07
3 HTA10_07
Name: HTAN Parent Biospecimen ID, dtype: object
关于正则表达式的解释:Regex 101
英文:
You can use str.replace
:
>>> df['HTAN Parent Biospecimen ID'].str.replace('_\d+$', '', regex=True)
0 HTA10_07
1 HTA10_07
2 HTA10_07
3 HTA10_07
Name: HTAN Parent Biospecimen ID, dtype: object
Explanation about regex: Regex 101
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论