2023年5月21日 05:09:43go评论109阅读模式

英文:

Replace characters and extract substrings from pandas dataframe

问题

I can help you modify the regex to include the ''' character for removal in the mentioned rows. Here's the modified regex pattern:

df[['label', 'id']] = df['name'].str.extract(r'(?:\[?\??\|?[[{]?(.*?)[]}]?\]|(.*?))\s+\(\d+\)')

This updated pattern should capture both cases where the ''' character is present within square brackets and where it is not.

英文:

I have following pandas dataframe. I would like to replace some characters and extract substrings (there exists more rows in the original dataframe).

I am using following regex but I am unable to replace '?' from some rows like row 6, 7, 8.

df[[&#39;label&#39;, &#39;id&#39;]] = df[&#39;name&#39;].str.extract(r&#39;\{?\??\|?[[{]?(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)&#39;)

You-Hoover-Fong syndrome, 616954 (3)
Yuan-Harel-Lupski syndrome (4)
Zaki syndrome, 619648 (3)
Zimmermann-Laband syndrome 2, 616455 (3)
Zimmermann-Laband syndrome 3, 618658 (3)
[?Birbeck granule deficiency], 613393 (3)
[?Homosexuality, male] (2)
[?Phosphohydroxylysinuria], 615011 (3)
[Acetylation, slow], 243400 (3)

The expected output is:

You-Hoover-Fong syndrome          616954  
Yuan-Harel-Lupski syndrome
Zaki syndrome                     619648
Zimmermann-Laband syndrome 2      616455 
Zimmermann-Laband syndrome 3      618658 
Birbeck granule deficiency        613393 
Homosexuality, male 
Phosphohydroxylysinuria           615011 
Acetylation, slow                 243400

How can I modify the current regex to include the '?' to remove from the mentioned rows?

答案1

得分: 1

                           text  number
0      You-Hoover-Fong syndrome  616954
1    Yuan-Harel-Lupski syndrome        
2                 Zaki syndrome  619648
3  Zimmermann-Laband syndrome 2  616455
4  Zimmermann-Laband syndrome 3  618658
5    Birbeck granule deficiency  613393
6           Homosexuality, male        
7       Phosphohydroxylysinuria  615011
8             Acetylation, slow  243400

英文:

Try:

df[&#39;number&#39;] = df[&#39;text&#39;].str.extract(r&#39;(\d{6})&#39;).fillna(&#39;&#39;)
df[&#39;text&#39;] = df[&#39;text&#39;].str.extract(r&#39;^[^a-zA-Z]*(.*?(?:\s*(?&lt;!\()\d{,2}))[^a-zA-Z]*$&#39;)
df[&#39;text&#39;] = df[&#39;text&#39;].str.strip()
print(df)

Prints:

                           text  number
0      You-Hoover-Fong syndrome  616954
1    Yuan-Harel-Lupski syndrome        
2                 Zaki syndrome  619648
3  Zimmermann-Laband syndrome 2  616455
4  Zimmermann-Laband syndrome 3  618658
5    Birbeck granule deficiency  613393
6           Homosexuality, male        
7       Phosphohydroxylysinuria  615011
8             Acetylation, slow  243400

Initial dataframe:

                                        text
0       You-Hoover-Fong syndrome, 616954 (3)
1             Yuan-Harel-Lupski syndrome (4)
2                  Zaki syndrome, 619648 (3)
3   Zimmermann-Laband syndrome 2, 616455 (3)
4   Zimmermann-Laband syndrome 3, 618658 (3)
5  [?Birbeck granule deficiency], 613393 (3)
6                 [?Homosexuality, male] (2)
7     [?Phosphohydroxylysinuria], 615011 (3)
8            [Acetylation, slow], 243400 (3)

答案2

得分: 1

你可以只更改正则表达式的第一部分，以匹配任意数量的 `[`, `{`, `|` 或 `?` 字符：
```regex
[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)

在Python中

df[['label', 'id']] = df['name'].str.extract(r'[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)').fillna('')

输出：

                                        name                         label      id
0       You-Hoover-Fong syndrome, 616954 (3)      You-Hoover-Fong syndrome  616954
1             Yuan-Harel-Lupski syndrome (4)    Yuan-Harel-Lupski syndrome
2                  Zaki syndrome, 619648 (3)                 Zaki syndrome  619648
3   Zimmermann-Laband syndrome 2, 616455 (3)  Zimmermann-Laband syndrome 2  616455
4   Zimmermann-Laband syndrome 3, 618658 (3)  Zimmermann-Laband syndrome 3  618658
5  [?Birbeck granule deficiency], 613393 (3)    Birbeck granule deficiency  613393
6                 [?Homosexuality, male] (2)           Homosexuality, male
7     [?Phosphohydroxylysinuria], 615011 (3)       Phosphohydroxylysinuria  615011
8            [Acetylation, slow], 243400 (3)             Acetylation, slow  243400


<details>
<summary>英文:</summary>
You could just change the first part of your regex to match any number of `[`, `{`, `|` or `?` characters:
```regex
[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)

Demo on regex101

In python

df[[&#39;label&#39;, &#39;id&#39;]] = df[&#39;name&#39;].str.extract(r&#39;[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)&#39;).fillna(&#39;&#39;)

Output:

                                        name                         label      id
0       You-Hoover-Fong syndrome, 616954 (3)      You-Hoover-Fong syndrome  616954
1             Yuan-Harel-Lupski syndrome (4)    Yuan-Harel-Lupski syndrome
2                  Zaki syndrome, 619648 (3)                 Zaki syndrome  619648
3   Zimmermann-Laband syndrome 2, 616455 (3)  Zimmermann-Laband syndrome 2  616455
4   Zimmermann-Laband syndrome 3, 618658 (3)  Zimmermann-Laband syndrome 3  618658
5  [?Birbeck granule deficiency], 613393 (3)    Birbeck granule deficiency  613393
6                 [?Homosexuality, male] (2)           Homosexuality, male
7     [?Phosphohydroxylysinuria], 615011 (3)       Phosphohydroxylysinuria  615011
8            [Acetylation, slow], 243400 (3)             Acetylation, slow  243400

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

替换字符并从pandas数据框中提取子字符串

问题

答案1

答案2

Calculating weighted average by sorting and aggregating in a pandas dataframe.

如何在pandas数据框中限制行数？

Keras将图像视为数组的数组，而不是单个图片。

Delete from treeview in tkinter

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。