英文:
Replace characters and extract substrings from pandas dataframe
问题
I can help you modify the regex to include the ''' character for removal in the mentioned rows. Here's the modified regex pattern:
df[['label', 'id']] = df['name'].str.extract(r'(?:\[?\??\|?[[{]?(.*?)[]}]?\]|(.*?))\s+\(\d+\)')
This updated pattern should capture both cases where the ''' character is present within square brackets and where it is not.
英文:
I have following pandas dataframe. I would like to replace some characters and extract substrings (there exists more rows in the original dataframe).
I am using following regex but I am unable to replace '?' from some rows like row 6, 7, 8.
df[['label', 'id']] = df['name'].str.extract(r'\{?\??\|?[[{]?(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)')
You-Hoover-Fong syndrome, 616954 (3)
Yuan-Harel-Lupski syndrome (4)
Zaki syndrome, 619648 (3)
Zimmermann-Laband syndrome 2, 616455 (3)
Zimmermann-Laband syndrome 3, 618658 (3)
[?Birbeck granule deficiency], 613393 (3)
[?Homosexuality, male] (2)
[?Phosphohydroxylysinuria], 615011 (3)
[Acetylation, slow], 243400 (3)
The expected output is:
You-Hoover-Fong syndrome 616954
Yuan-Harel-Lupski syndrome
Zaki syndrome 619648
Zimmermann-Laband syndrome 2 616455
Zimmermann-Laband syndrome 3 618658
Birbeck granule deficiency 613393
Homosexuality, male
Phosphohydroxylysinuria 615011
Acetylation, slow 243400
How can I modify the current regex to include the '?' to remove from the mentioned rows?
答案1
得分: 1
text number
0 You-Hoover-Fong syndrome 616954
1 Yuan-Harel-Lupski syndrome
2 Zaki syndrome 619648
3 Zimmermann-Laband syndrome 2 616455
4 Zimmermann-Laband syndrome 3 618658
5 Birbeck granule deficiency 613393
6 Homosexuality, male
7 Phosphohydroxylysinuria 615011
8 Acetylation, slow 243400
英文:
Try:
df['number'] = df['text'].str.extract(r'(\d{6})').fillna('')
df['text'] = df['text'].str.extract(r'^[^a-zA-Z]*(.*?(?:\s*(?<!\()\d{,2}))[^a-zA-Z]*$')
df['text'] = df['text'].str.strip()
print(df)
Prints:
text number
0 You-Hoover-Fong syndrome 616954
1 Yuan-Harel-Lupski syndrome
2 Zaki syndrome 619648
3 Zimmermann-Laband syndrome 2 616455
4 Zimmermann-Laband syndrome 3 618658
5 Birbeck granule deficiency 613393
6 Homosexuality, male
7 Phosphohydroxylysinuria 615011
8 Acetylation, slow 243400
Initial dataframe:
text
0 You-Hoover-Fong syndrome, 616954 (3)
1 Yuan-Harel-Lupski syndrome (4)
2 Zaki syndrome, 619648 (3)
3 Zimmermann-Laband syndrome 2, 616455 (3)
4 Zimmermann-Laband syndrome 3, 618658 (3)
5 [?Birbeck granule deficiency], 613393 (3)
6 [?Homosexuality, male] (2)
7 [?Phosphohydroxylysinuria], 615011 (3)
8 [Acetylation, slow], 243400 (3)
答案2
得分: 1
你可以只更改正则表达式的第一部分,以匹配任意数量的 `[`, `{`, `|` 或 `?` 字符:
```regex
[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)
在Python中
df[['label', 'id']] = df['name'].str.extract(r'[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)').fillna('')
输出:
name label id
0 You-Hoover-Fong syndrome, 616954 (3) You-Hoover-Fong syndrome 616954
1 Yuan-Harel-Lupski syndrome (4) Yuan-Harel-Lupski syndrome
2 Zaki syndrome, 619648 (3) Zaki syndrome 619648
3 Zimmermann-Laband syndrome 2, 616455 (3) Zimmermann-Laband syndrome 2 616455
4 Zimmermann-Laband syndrome 3, 618658 (3) Zimmermann-Laband syndrome 3 618658
5 [?Birbeck granule deficiency], 613393 (3) Birbeck granule deficiency 613393
6 [?Homosexuality, male] (2) Homosexuality, male
7 [?Phosphohydroxylysinuria], 615011 (3) Phosphohydroxylysinuria 615011
8 [Acetylation, slow], 243400 (3) Acetylation, slow 243400
<details>
<summary>英文:</summary>
You could just change the first part of your regex to match any number of `[`, `{`, `|` or `?` characters:
```regex
[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)
In python
df[['label', 'id']] = df['name'].str.extract(r'[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)').fillna('')
Output:
name label id
0 You-Hoover-Fong syndrome, 616954 (3) You-Hoover-Fong syndrome 616954
1 Yuan-Harel-Lupski syndrome (4) Yuan-Harel-Lupski syndrome
2 Zaki syndrome, 619648 (3) Zaki syndrome 619648
3 Zimmermann-Laband syndrome 2, 616455 (3) Zimmermann-Laband syndrome 2 616455
4 Zimmermann-Laband syndrome 3, 618658 (3) Zimmermann-Laband syndrome 3 618658
5 [?Birbeck granule deficiency], 613393 (3) Birbeck granule deficiency 613393
6 [?Homosexuality, male] (2) Homosexuality, male
7 [?Phosphohydroxylysinuria], 615011 (3) Phosphohydroxylysinuria 615011
8 [Acetylation, slow], 243400 (3) Acetylation, slow 243400
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论