Polars 中的 .str.replace 使用表达式 或 .str.split 使用正则表达式

huangapple go评论53阅读模式
英文:

Polars .str.replace with expression or .str.split with regex

问题

我明白了,你想要对这个 dataframe 进行操作,将单词间的空格替换成逗号。以下是你想要的代码:

answer = pl.DataFrame({"equip": ['Amulets, Medals', 'Guns, Crossbows, Off-Hands', 'Melee Weapons, Shields, Off-Hands',
 'All Armor', 'Chest Armor', 'Shields', 'All Weapons, Shields, Off-Hands']})
print(answer)

对于你的额外问题,你可以使用类似的方法来处理更复杂的正则表达式模式。如果需要额外的帮助,请随时提出。

英文:

I have this dataframe:

sample = pl.DataFrame({"equip": ['AmuletsMedals', 'Guns, CrossbowsOff-Hands', 'Melee WeaponsShieldsOff-Hands',
     'All Armor', 'Chest Armor', 'Shields', 'All WeaponsShieldsOff-Hands']})
    print(sample)

   shape: (7, 1)
┌───────────────────────────────┐
│ equip                         │
│ ---                           │
│ str                           │
╞═══════════════════════════════╡
│ AmuletsMedals                 │
│ Guns, CrossbowsOff-Hands      │
│ Melee WeaponsShieldsOff-Hands │
│ All Armor                     │
│ Chest Armor                   │
│ Shields                       │
│ All WeaponsShieldsOff-Hands   │
└───────────────────────────────┘

My aim is to put a comma between words:

answer = pl.DataFrame({"equip": ['Amulets, Medals', 'Guns, Crossbows, Off-Hands', 'Melee Weapons, Shields, Off-Hands',
 'All Armor', 'Chest Armor', 'Shields', 'All Weapons, Shields, Off-Hands']})
print(answer)
shape: (7, 1)
┌─────────────────────────────────────┐
│ equip                               │
│ ---                                 │
│ str                                 │
╞═════════════════════════════════════╡
│ Amulets, Medals                     │
│ Guns, Crossbows, Off-Hands          │
│ Melee Weapons, Shields, Off-Hand... │
│ All Armor                           │
│ Chest Armor                         │
│ Shields                             │
│ All Weapons, Shields, Off-Hands     │
└─────────────────────────────────────┘

I tried replace, but the replace didn't take an expression:

sample.with_columns(pl.col("equip").str.replace("[a-z][A-Z]", "[a-z], [A-Z]"))

and a tip found on polars github, but it cuts the last and first letter of the first and last word on each encounter, as it would with:

sample.with_columns(pl.col("equip").str.replace("[a-z][A-Z]", ", "))

Any ideas?

Bonus question:
I imagine the answer for the simple case would also solve the harder case, but in case it does not, here is the hard case:

I do have another column with a slightly harder regex pattern than "[a-z][A-Z]", should be something like "[a-z][A-Z]|[a-z]+|[a-z][1-9]" (I did not stress much about the exact regex yet). The aim is also to just put a comma between attributes:

sample2 = pl.DataFrame({"attributes": ['+10% Aether Damage+30 Defensive Ability16% Aether Resistance6% Less Damage from Aetherials6% Less Damage from Aether Corruptions',
     '4-6 Aether Damage+25% Aether Damage10% Physical Damage converted to Aether DamageAether Tendril (Granted by Item)',
     '2-8 Lightning Damage+25% Lightning Damage+25% Electrocute Damage10% Physical Damage converted to Lightning DamageEmpowered Lightning Nova (Granted by Item)',
     '+10 Health Regenerated per Second+24 Armor20% Poison & Acid Resistance',
     '+22 Defensive Ability10% Chance to Avoid Projectiles+18 Armor',
     '+15 Physique+10% Shield Block ChanceShield Slam (Granted by Item)',
     '+10% Chaos Damage+30 Defensive Ability16% Chaos Resistance6% Less Damage from Chthonics']})

答案1

得分: 1

你可以在你的模式中使用捕获组:

df.with_columns(pl.col("equip").str.replace_all(r"([a-z])([A-Z])", "$1, $2"))

shape: (7, 1)
┌─────────────────────────────────────┐
│ equip │
│ --- │
│ str │
╞═════════════════════════════════════╡
│ Amulets, Medals │
│ Guns, Crossbows, Off-Hands │
│ Melee Weapons, Shields, Off-Hand... │
│ All Armor │
│ Chest Armor │
│ Shields │
│ All Weapons, Shields, Off-Hands │
└─────────────────────────────────────┘

你也可以考虑使用 Unicode 类别 `\p{lower}` 和 `\p{upper}`。
polars 支持的正则表达式语法请参考:https://docs.rs/regex/latest/regex/
英文:

You can use capture groups in your pattern:

df.with_columns(pl.col("equip").str.replace_all(r"([a-z])([A-Z])", "$1, $2"))
shape: (7, 1)
┌─────────────────────────────────────┐
│ equip                               │
│ ---                                 │
│ str                                 │
╞═════════════════════════════════════╡
│ Amulets, Medals                     │
│ Guns, Crossbows, Off-Hands          │
│ Melee Weapons, Shields, Off-Hand... │
│ All Armor                           │
│ Chest Armor                         │
│ Shields                             │
│ All Weapons, Shields, Off-Hands     │
└─────────────────────────────────────┘

You may also want to use the unicode classes \p{lower} and \p{upper} instead.

The regex syntax that polars supports is: https://docs.rs/regex/latest/regex/

huangapple
  • 本文由 发表于 2023年4月6日 22:00:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75950392.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定