替换字符并从pandas数据框中提取子字符串

huangapple go评论62阅读模式
英文:

Replace characters and extract substrings from pandas dataframe

问题

I can help you modify the regex to include the ''' character for removal in the mentioned rows. Here's the modified regex pattern:

df[['label', 'id']] = df['name'].str.extract(r'(?:\[?\??\|?[[{]?(.*?)[]}]?\]|(.*?))\s+\(\d+\)')

This updated pattern should capture both cases where the ''' character is present within square brackets and where it is not.

英文:

I have following pandas dataframe. I would like to replace some characters and extract substrings (there exists more rows in the original dataframe).

I am using following regex but I am unable to replace '?' from some rows like row 6, 7, 8.

df[['label', 'id']] = df['name'].str.extract(r'\{?\??\|?[[{]?(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)')
You-Hoover-Fong syndrome, 616954 (3)
Yuan-Harel-Lupski syndrome (4)
Zaki syndrome, 619648 (3)
Zimmermann-Laband syndrome 2, 616455 (3)
Zimmermann-Laband syndrome 3, 618658 (3)
[?Birbeck granule deficiency], 613393 (3)
[?Homosexuality, male] (2)
[?Phosphohydroxylysinuria], 615011 (3)
[Acetylation, slow], 243400 (3)

The expected output is:

You-Hoover-Fong syndrome          616954  
Yuan-Harel-Lupski syndrome
Zaki syndrome                     619648
Zimmermann-Laband syndrome 2      616455 
Zimmermann-Laband syndrome 3      618658 
Birbeck granule deficiency        613393 
Homosexuality, male 
Phosphohydroxylysinuria           615011 
Acetylation, slow                 243400 

How can I modify the current regex to include the '?' to remove from the mentioned rows?

答案1

得分: 1

                           text  number
0      You-Hoover-Fong syndrome  616954
1    Yuan-Harel-Lupski syndrome        
2                 Zaki syndrome  619648
3  Zimmermann-Laband syndrome 2  616455
4  Zimmermann-Laband syndrome 3  618658
5    Birbeck granule deficiency  613393
6           Homosexuality, male        
7       Phosphohydroxylysinuria  615011
8             Acetylation, slow  243400
英文:

Try:

df['number'] = df['text'].str.extract(r'(\d{6})').fillna('')
df['text'] = df['text'].str.extract(r'^[^a-zA-Z]*(.*?(?:\s*(?<!\()\d{,2}))[^a-zA-Z]*$')
df['text'] = df['text'].str.strip()
print(df)

Prints:

                           text  number
0      You-Hoover-Fong syndrome  616954
1    Yuan-Harel-Lupski syndrome        
2                 Zaki syndrome  619648
3  Zimmermann-Laband syndrome 2  616455
4  Zimmermann-Laband syndrome 3  618658
5    Birbeck granule deficiency  613393
6           Homosexuality, male        
7       Phosphohydroxylysinuria  615011
8             Acetylation, slow  243400

Initial dataframe:

                                        text
0       You-Hoover-Fong syndrome, 616954 (3)
1             Yuan-Harel-Lupski syndrome (4)
2                  Zaki syndrome, 619648 (3)
3   Zimmermann-Laband syndrome 2, 616455 (3)
4   Zimmermann-Laband syndrome 3, 618658 (3)
5  [?Birbeck granule deficiency], 613393 (3)
6                 [?Homosexuality, male] (2)
7     [?Phosphohydroxylysinuria], 615011 (3)
8            [Acetylation, slow], 243400 (3)

答案2

得分: 1

你可以只更改正则表达式的第一部分,以匹配任意数量的 `[`, `{`, `|` 或 `?` 字符:
```regex
[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)

在Python中

df[['label', 'id']] = df['name'].str.extract(r'[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)').fillna('')

输出:

                                        name                         label      id
0       You-Hoover-Fong syndrome, 616954 (3)      You-Hoover-Fong syndrome  616954
1             Yuan-Harel-Lupski syndrome (4)    Yuan-Harel-Lupski syndrome
2                  Zaki syndrome, 619648 (3)                 Zaki syndrome  619648
3   Zimmermann-Laband syndrome 2, 616455 (3)  Zimmermann-Laband syndrome 2  616455
4   Zimmermann-Laband syndrome 3, 618658 (3)  Zimmermann-Laband syndrome 3  618658
5  [?Birbeck granule deficiency], 613393 (3)    Birbeck granule deficiency  613393
6                 [?Homosexuality, male] (2)           Homosexuality, male
7     [?Phosphohydroxylysinuria], 615011 (3)       Phosphohydroxylysinuria  615011
8            [Acetylation, slow], 243400 (3)             Acetylation, slow  243400

<details>
<summary>英文:</summary>

You could just change the first part of your regex to match any number of `[`, `{`, `|` or `?` characters:
```regex
[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)

Demo on regex101

In python

df[[&#39;label&#39;, &#39;id&#39;]] = df[&#39;name&#39;].str.extract(r&#39;[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)&#39;).fillna(&#39;&#39;)

Output:

                                        name                         label      id
0       You-Hoover-Fong syndrome, 616954 (3)      You-Hoover-Fong syndrome  616954
1             Yuan-Harel-Lupski syndrome (4)    Yuan-Harel-Lupski syndrome
2                  Zaki syndrome, 619648 (3)                 Zaki syndrome  619648
3   Zimmermann-Laband syndrome 2, 616455 (3)  Zimmermann-Laband syndrome 2  616455
4   Zimmermann-Laband syndrome 3, 618658 (3)  Zimmermann-Laband syndrome 3  618658
5  [?Birbeck granule deficiency], 613393 (3)    Birbeck granule deficiency  613393
6                 [?Homosexuality, male] (2)           Homosexuality, male
7     [?Phosphohydroxylysinuria], 615011 (3)       Phosphohydroxylysinuria  615011
8            [Acetylation, slow], 243400 (3)             Acetylation, slow  243400

huangapple
  • 本文由 发表于 2023年5月21日 05:09:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76297358.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定