正则表达式,用于移除两个特定字符之间的空格。

huangapple go评论67阅读模式
英文:

Regex that removes whitespaces between two specific characters

问题

In pyspark中,我有以下表达式:

df.withColumn('new_descriptions', lower(regexp_replace('descriptions', r"\t+", '')))

这基本上会移除制表符并将我的描述列变成小写。

这是我的描述列的一些示例:

['banha frimesa 450 gr','manteiga com sal tourinho pote 200 g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']

我想要做的是能够移除值和单位之间的空格。例如,在这个例子中 banha frimesa 450 gr,我想让它变成 banha frimesa 450gr

但我也需要避免移除数字和带单位的数字之间的空格。

例如,这个例子 farinha de trigo especial 101 5kg 应该保持不变。

我应该使用什么样的正则表达式来只移除值和带有单位的kg、ml、l、g之间的空格?

期望的结果:

['banha frimesa 450gr','manteiga com sal tourinho pote 200g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']
英文:

In pyspark I have the following expression

df.withColumn('new_descriptions',lower(regexp_replace('descriptions',r"\t+",'')))

Which basically removes tab characters and makes my descriptions columns become lower

Here is a list samples of my descriptions columns

['banha frimesa 450 gr','manteiga com sal tourinho pote 200 g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']

What I want to do is to be able to remove the whitespaces that are between the value and it is unit.
For example in this guy banha frimesa 450 gr, I want it to become banha frimesa 450gr.

But I also need to avoid removing whitespaces that are between a digit and digit with unit.

For example, this guy farinha de trigo especial 101 5kg** should stay the same.

What kind of regex should I use to only remove the whitespace that are between the kg,ml,l,g unit and it is value?

Wanted Result:

['banha frimesa 450gr','manteiga com sal tourinho pote 200g','acucar refinado caravelas pacote 1kg',
    'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']

答案1

得分: 2

你可以替换由数字前导和字母后跟的空格(还可以在前瞻中指定所有可能的单位)。

英文:

You could replace whitespace preceded by a digit and followed by a letter (you could also specify all the possible units in the lookahead).

r'(?<=\d)\s+(?=[a-zA-Z])'

huangapple
  • 本文由 发表于 2023年5月14日 00:40:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76243849.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定