替换字符并从pandas数据框中提取子字符串

huangapple go评论109阅读模式
英文:

Replace characters and extract substrings from pandas dataframe

问题

I can help you modify the regex to include the ''' character for removal in the mentioned rows. Here's the modified regex pattern:

  1. df[['label', 'id']] = df['name'].str.extract(r'(?:\[?\??\|?[[{]?(.*?)[]}]?\]|(.*?))\s+\(\d+\)')

This updated pattern should capture both cases where the ''' character is present within square brackets and where it is not.

英文:

I have following pandas dataframe. I would like to replace some characters and extract substrings (there exists more rows in the original dataframe).

I am using following regex but I am unable to replace '?' from some rows like row 6, 7, 8.

  1. df[['label', 'id']] = df['name'].str.extract(r'\{?\??\|?[[{]?(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)')
  1. You-Hoover-Fong syndrome, 616954 (3)
  2. Yuan-Harel-Lupski syndrome (4)
  3. Zaki syndrome, 619648 (3)
  4. Zimmermann-Laband syndrome 2, 616455 (3)
  5. Zimmermann-Laband syndrome 3, 618658 (3)
  6. [?Birbeck granule deficiency], 613393 (3)
  7. [?Homosexuality, male] (2)
  8. [?Phosphohydroxylysinuria], 615011 (3)
  9. [Acetylation, slow], 243400 (3)

The expected output is:

  1. You-Hoover-Fong syndrome 616954
  2. Yuan-Harel-Lupski syndrome
  3. Zaki syndrome 619648
  4. Zimmermann-Laband syndrome 2 616455
  5. Zimmermann-Laband syndrome 3 618658
  6. Birbeck granule deficiency 613393
  7. Homosexuality, male
  8. Phosphohydroxylysinuria 615011
  9. Acetylation, slow 243400

How can I modify the current regex to include the '?' to remove from the mentioned rows?

答案1

得分: 1

  1. text number
  2. 0 You-Hoover-Fong syndrome 616954
  3. 1 Yuan-Harel-Lupski syndrome
  4. 2 Zaki syndrome 619648
  5. 3 Zimmermann-Laband syndrome 2 616455
  6. 4 Zimmermann-Laband syndrome 3 618658
  7. 5 Birbeck granule deficiency 613393
  8. 6 Homosexuality, male
  9. 7 Phosphohydroxylysinuria 615011
  10. 8 Acetylation, slow 243400
英文:

Try:

  1. df['number'] = df['text'].str.extract(r'(\d{6})').fillna('')
  2. df['text'] = df['text'].str.extract(r'^[^a-zA-Z]*(.*?(?:\s*(?<!\()\d{,2}))[^a-zA-Z]*$')
  3. df['text'] = df['text'].str.strip()
  4. print(df)

Prints:

  1. text number
  2. 0 You-Hoover-Fong syndrome 616954
  3. 1 Yuan-Harel-Lupski syndrome
  4. 2 Zaki syndrome 619648
  5. 3 Zimmermann-Laband syndrome 2 616455
  6. 4 Zimmermann-Laband syndrome 3 618658
  7. 5 Birbeck granule deficiency 613393
  8. 6 Homosexuality, male
  9. 7 Phosphohydroxylysinuria 615011
  10. 8 Acetylation, slow 243400

Initial dataframe:

  1. text
  2. 0 You-Hoover-Fong syndrome, 616954 (3)
  3. 1 Yuan-Harel-Lupski syndrome (4)
  4. 2 Zaki syndrome, 619648 (3)
  5. 3 Zimmermann-Laband syndrome 2, 616455 (3)
  6. 4 Zimmermann-Laband syndrome 3, 618658 (3)
  7. 5 [?Birbeck granule deficiency], 613393 (3)
  8. 6 [?Homosexuality, male] (2)
  9. 7 [?Phosphohydroxylysinuria], 615011 (3)
  10. 8 [Acetylation, slow], 243400 (3)

答案2

得分: 1

  1. 你可以只更改正则表达式的第一部分,以匹配任意数量的 `[`, `{`, `|` `?` 字符:
  2. ```regex
  3. [[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)

在Python中

  1. df[['label', 'id']] = df['name'].str.extract(r'[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)').fillna('')

输出:

  1. name label id
  2. 0 You-Hoover-Fong syndrome, 616954 (3) You-Hoover-Fong syndrome 616954
  3. 1 Yuan-Harel-Lupski syndrome (4) Yuan-Harel-Lupski syndrome
  4. 2 Zaki syndrome, 619648 (3) Zaki syndrome 619648
  5. 3 Zimmermann-Laband syndrome 2, 616455 (3) Zimmermann-Laband syndrome 2 616455
  6. 4 Zimmermann-Laband syndrome 3, 618658 (3) Zimmermann-Laband syndrome 3 618658
  7. 5 [?Birbeck granule deficiency], 613393 (3) Birbeck granule deficiency 613393
  8. 6 [?Homosexuality, male] (2) Homosexuality, male
  9. 7 [?Phosphohydroxylysinuria], 615011 (3) Phosphohydroxylysinuria 615011
  10. 8 [Acetylation, slow], 243400 (3) Acetylation, slow 243400
  1. <details>
  2. <summary>英文:</summary>
  3. You could just change the first part of your regex to match any number of `[`, `{`, `|` or `?` characters:
  4. ```regex
  5. [[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)

Demo on regex101

In python

  1. df[[&#39;label&#39;, &#39;id&#39;]] = df[&#39;name&#39;].str.extract(r&#39;[[{?|]*(.*?)[]}]?(?:,\s+(\d{3,100}))?\s+\(\d+\)&#39;).fillna(&#39;&#39;)

Output:

  1. name label id
  2. 0 You-Hoover-Fong syndrome, 616954 (3) You-Hoover-Fong syndrome 616954
  3. 1 Yuan-Harel-Lupski syndrome (4) Yuan-Harel-Lupski syndrome
  4. 2 Zaki syndrome, 619648 (3) Zaki syndrome 619648
  5. 3 Zimmermann-Laband syndrome 2, 616455 (3) Zimmermann-Laband syndrome 2 616455
  6. 4 Zimmermann-Laband syndrome 3, 618658 (3) Zimmermann-Laband syndrome 3 618658
  7. 5 [?Birbeck granule deficiency], 613393 (3) Birbeck granule deficiency 613393
  8. 6 [?Homosexuality, male] (2) Homosexuality, male
  9. 7 [?Phosphohydroxylysinuria], 615011 (3) Phosphohydroxylysinuria 615011
  10. 8 [Acetylation, slow], 243400 (3) Acetylation, slow 243400

huangapple
  • 本文由 发表于 2023年5月21日 05:09:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76297358.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定