问题

我明白你想要修改正则表达式模式以匹配字符串中的第三次出现的"|"后的内容，如果没有第三次出现，则匹配最后一次出现的情况。你可以使用以下修改后的正则表达式模式来实现这一目标：

# 修改正则表达式模式
searching_root = r'.*\|([^|]+)(?:\|[^|]+){0,2}$'

def searching_taxonomy(text):
    """
    :param text: 要搜索的字符串
    :return: 第一个非空字符串
    """
    # 搜索模式
    match = re.search(searching_root, text)
    # 如果匹配不为None，则返回第一个匹配项
    # 移除任何前导和尾随的空白字符
    return match.group(1).strip() if match else None

这个修改后的正则表达式模式会匹配第三次出现的"|"之后的内容，或者如果没有第三次出现，就匹配最后一次出现的内容。这样，你的代码仍然可以按照你的要求提取相应的信息。

英文:

I have a data frame that looks like this:

id	ec_number_clean	euclidean_clean	cluster	label	is_akker	eggNOG_OGs	COG_category	Description	GOs	EC	CAZy
04XYNu_00699	3.2.1.52	0.0968	49	non-singleton	akkermansia	COG3525@1|root,COG3525@2|Bacteria,46UY2@74201|Verrucomicrobia,2IU6A@203494|Verrucomicrobiae	G	Glycoside hydrolase, family 20, catalytic core	-	3.2.1.52	GH20

The column that I'm interested in is the eggNOG_OGs. This column has a particular format that is not always the same in all rows. Here an example:

COG3525@1|root,COG3525@2|Bacteria,46UY2@74201|Verrucomicrobia,2IU6A@203494|Verrucomicrobiae

COG3525@1|root,COG3525@2|Bacteria

COG3525@1|root,KOG2499@2759|Eukaryota,38D1Y@33154|Opisthokonta,3NUJ9@4751|Fungi,3QMST@4890|Ascomycota,216QI@147550|Sordariomycetes,3TDHM@5125|Hypocreales,3G4R2@34397|Clavicipitaceae

COG3525@1|root,KOG2499@2759|Eukaryota,3ZBNG@5878|Ciliophora

As you can see, the pattern to follow here is the "|" (pipe) in the string.
My code uses regex to find the last occurrence of the "|" and create a new column with the string that is immediately after the last occurrence of the "|".

Now, I need to do something slightly different. Instead of the last occurrence, I need to stop after 3 occurrences of the "|", for example, based on the four lines just above this text, the new column must contain this information on each row:

Verrucomicrobia
Bacteria
Opisthokonta
Ciliophora

Here, there is little detail, sometimes there is not a third occurrence of "|". In that case, if there is not a third occurrence, just put the string after the last occurrence. For that reason, in the second line, I put Bacteria, due to the absent of a third occurrence of "|".

Here is my code, that works perfectly to find the string after the last occurrences of "|":

# Read file
input_file_1 = sys.argv[1]
output_file_1 = sys.argv[2]

# .*: match any character (except newlines), this is based on the &quot;greedily regex method&quot;
# \|: match the last occurrence of &quot;|&quot;
# ([^|]+)$: capture everything after the last occurrences of &quot;|&quot;, so in this case everything that start with &quot;|&quot;.
# The [^|]+ means one or more characters that are not &quot;|&quot;. Finally, the $ matches the end of the string.
searching_root = r&#39;.*\|([^|]+)$&#39;

def searching_taxonomy(text):
    &quot;&quot;&quot;
    :param text: pattern that is search
    :return: the first not None string
    &quot;&quot;&quot;
    # Search for pattern
    match = re.search(searching_root, text)
    # If match is not None, return the first match
    # Remove any leading and trailing whitespace characters
    return match.group(1).strip() if match else None


# Define data frame
df_input = pd.read_csv(input_file_1, header=0, sep=&quot;\t&quot;)

# Create a new column and apply the function above to append the matches
df_input[&#39;eggnog_taxonomy&#39;] = df_input[&#39;eggNOG_OGs&#39;].apply(searching_taxonomy)

I do not know if the regex pattern that I'm using has a particular name, but I know that has a "greedy behavior". However, I think that my goal or idea is more like a strict greedy behavior because I need everything (string) after three times the occurrence of "|" but nothing more. As well as if the occurrence is not three times, just put the last one.

Any idea to modify only the pattern? Maybe combining some regex techniques.
Maybe add an if statement based on the times of occurrences, however, I want to check (first) if it is possible to modify the regex.

答案1

得分: 1

可以通过组合使用 split + replace 来实现：

df['eggNOG_OGs'].str.split('|', n=3).str[-1].replace(r',.*', '', regex=True)

输出结果如下：

0    Verrucomicrobia
1           Bacteria
2       Opisthokonta
3         Ciliophora
Name: eggNOG_OGs, dtype: object

英文:

It can be achieved with combination of split + replace:

df[&#39;eggNOG_OGs&#39;].str.split(&#39;|&#39;, n=3).str[-1].replace(r&#39;,.*&#39;, &#39;&#39;, regex=True)

Out[367]: 
0    Verrucomicrobia
1           Bacteria
2       Opisthokonta
3         Ciliophora
Name: eggNOG_OGs, dtype: object

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用贪婪行为匹配字符串在x次出现之后

问题

答案1

从PySpark数据框的行中检索非空值，并将此值存储在新列中。

在嵌套循环中设置 “break” 的位置。

在Python的Jinja模板中，如何将整数转换为序数字符串

这个更新查询为什么在我通过Python运行时不起作用？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论