2023年5月10日 21:17:16go评论76阅读模式

英文:

How to remove everything after the last occurrence of a delimiter?

问题

我想删除HTAN Parent Biospecimen ID列中最后一个出现的_分隔符后的所有内容。

import pandas as pd
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)

数据：

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07',
  2: 'HTA10_07',
  3: 'HTA10_07'}})

期望输出：

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07',
  2: 'HTA10_07',
  3: 'HTA10_07'}})

英文:

I want to remove everything after the last occurrence of the _ delimiter in the HTAN Parent Biospecimen ID column.

import pandas as pd
df_2[&quot;HTAN Parent Biospecimen ID&quot;] = df_2[&quot;HTAN Parent Biospecimen ID&quot;].str.rsplit(&quot;_&quot;, 1).str.get(0)

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [41], in &lt;cell line: 3&gt;()
      1 # BulkRNA-seqLevel1
      2 df_2 = pd.read_csv(&quot;syn39282161.csv&quot;, sep=&quot;,&quot;)
----&gt; 3 df_2[&quot;HTAN Parent Biospecimen ID&quot;] = df_2[&quot;HTAN Parent Biospecimen ID&quot;].str.rsplit(&quot;_&quot;, 1).str.get(0)
      4 df_2.head()

File ~/.local/lib/python3.9/site-packages/pandas/core/strings/accessor.py:129, in forbid_nonstring_types.&lt;locals&gt;._forbid_nonstring_types.&lt;locals&gt;.wrapper(self, *args, **kwargs)
    124     msg = (
    125         f&quot;Cannot use .str.{func_name} with values of &quot;
    126         f&quot;inferred dtype &#39;{self._inferred_dtype}&#39;.&quot;
    127     )
    128     raise TypeError(msg)
--&gt; 129 return func(self, *args, **kwargs)

TypeError: rsplit() takes from 1 to 2 positional arguments but 3 were given

Data:

pd.DataFrame({&#39;Component&#39;: {0: &#39;BulkRNA-seqLevel1&#39;,
  1: &#39;BulkRNA-seqLevel1&#39;,
  2: &#39;BulkRNA-seqLevel1&#39;,
  3: &#39;BulkRNA-seqLevel1&#39;},
 &#39;Filename&#39;: {0: &#39;B001A001_1.fq.gz&#39;,
  1: &#39;B001A001_2.fq.gz&#39;,
  2: &#39;B001A006_1.fq.gz&#39;,
  3: &#39;B001A006_2.fq.gz&#39;},
 &#39;File Format&#39;: {0: &#39;fastq&#39;, 1: &#39;fastq&#39;, 2: &#39;fastq&#39;, 3: &#39;fastq&#39;},
 &#39;HTAN Parent Biospecimen ID&#39;: {0: &#39;HTA10_07_001&#39;,
  1: &#39;HTA10_07_001&#39;,
  2: &#39;HTA10_07_006&#39;,
  3: &#39;HTA10_07_006&#39;}})

Expected output:

pd.DataFrame({&#39;Component&#39;: {0: &#39;BulkRNA-seqLevel1&#39;,
  1: &#39;BulkRNA-seqLevel1&#39;,
  2: &#39;BulkRNA-seqLevel1&#39;,
  3: &#39;BulkRNA-seqLevel1&#39;},
 &#39;Filename&#39;: {0: &#39;B001A001_1.fq.gz&#39;,
  1: &#39;B001A001_2.fq.gz&#39;,
  2: &#39;B001A006_1.fq.gz&#39;,
  3: &#39;B001A006_2.fq.gz&#39;},
 &#39;File Format&#39;: {0: &#39;fastq&#39;, 1: &#39;fastq&#39;, 2: &#39;fastq&#39;, 3: &#39;fastq&#39;},
 &#39;HTAN Parent Biospecimen ID&#39;: {0: &#39;HTA10_07_001&#39;,
  1: &#39;HTA10_07&#39;,
  2: &#39;HTA10_07&#39;,
  3: &#39;HTA10_07&#39;}})

答案1

得分: 1

"尝试这样做：

df_2[&quot;HTAN 父类样本编号&quot;] = df_2[&quot;HTAN 父类样本编号&quot;].apply(lambda x:&quot;_&quot;.join(x.split(&quot;_&quot;)[:-1]))"

英文:

try this:

df_2[&quot;HTAN Parent Biospecimen ID&quot;] = df_2[&quot;HTAN Parent Biospecimen ID&quot;].apply(lambda x:&quot;_&quot;.join(x.split(&quot;_&quot;)[:-1]))

答案2

得分: 0

较早版本的pandas中，pat和n是位置参数，因此您可以执行.rsplit('_', 1)并且它可以正常工作。例如，查看pandas 1.0中的函数签名文档：

Series.str.rsplit(self, pat=None, n=-1, expand=False)

较新版本定义了n为只能通过关键字参数传递，因此您现在必须显式定义n=1，而不仅仅是位置参数1。查看pandas 2.0中的文档：

Series.str.rsplit(pat=None, *, n=-1, expand=False)

请注意，在pat=None之后定义了*，这表示现在只能通过关键字参数传递参数n。

简而言之，您需要从

df_2[col].str.rsplit("_", 1).str.get(0)

更改为

df_2[col].str.rsplit("_", n=1).str.get(0)

这样，它将适用于所有版本的pandas。

英文:

Earlier versions of pandas had pat and n as positional arguments, such that you could do .rsplit('_', 1) and it would work well. For example, take a look at the docs for the function signature for .str.rsplit @ pandas 1.0:

> Series.str.rsplit(self, pat=None, n=- 1, expand=False)

Newer versions have defined n to be a keyword-only argument, such that you have to define n=1 explicitly now, instead of just using 1 positionally. Take the docs for .str.rsplit @ pandas 2.0:

> Series.str.rsplit(pat=None, *, n=- 1, expand=False)

Notice how * is defined after pat=None, indicating that the only way to pass the parameter n now is via a keyword arg.

In a nutshell, you have to change from

df_2[col].str.rsplit(&quot;_&quot;, 1).str.get(0)

df_2[col].str.rsplit(&quot;_&quot;, n=1).str.get(0)

and that way, it will work for all pandas versions.

答案3

得分: 0

你可以使用 str.replace：

&gt;&gt;&gt; df['HTAN Parent Biospecimen ID'].str.replace('_\d+$', '', regex=True)
0    HTA10_07
1    HTA10_07
2    HTA10_07
3    HTA10_07
Name: HTAN Parent Biospecimen ID, dtype: object

关于正则表达式的解释：Regex 101

英文:

You can use str.replace:

&gt;&gt;&gt; df[&#39;HTAN Parent Biospecimen ID&#39;].str.replace(&#39;_\d+$&#39;, &#39;&#39;, regex=True)
0    HTA10_07
1    HTA10_07
2    HTA10_07
3    HTA10_07
Name: HTAN Parent Biospecimen ID, dtype: object

Explanation about regex: Regex 101

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何删除分隔符的最后一个出现位置之后的所有内容？

问题

答案1

答案2

答案3

Pandas中用循环进行多列筛选的函数

ValueError 由于在 pandas 数据框中替换值时出现重复轴。

避免在for循环中覆盖字典键。

在Django中基于角色实现访问控制

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论