英文:
Regex that removes whitespaces between two specific characters
问题
In pyspark中,我有以下表达式:
df.withColumn('new_descriptions', lower(regexp_replace('descriptions', r"\t+", '')))
这基本上会移除制表符并将我的描述列变成小写。
这是我的描述列的一些示例:
['banha frimesa 450 gr','manteiga com sal tourinho pote 200 g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']
我想要做的是能够移除值和单位之间的空格。例如,在这个例子中 banha frimesa 450 gr,我想让它变成 banha frimesa 450gr。
但我也需要避免移除数字和带单位的数字之间的空格。
例如,这个例子 farinha de trigo especial 101 5kg 应该保持不变。
我应该使用什么样的正则表达式来只移除值和带有单位的kg、ml、l、g之间的空格?
期望的结果:
['banha frimesa 450gr','manteiga com sal tourinho pote 200g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']
英文:
In pyspark I have the following expression
df.withColumn('new_descriptions',lower(regexp_replace('descriptions',r"\t+",'')))
Which basically removes tab characters and makes my descriptions columns become lower
Here is a list samples of my descriptions columns
['banha frimesa 450 gr','manteiga com sal tourinho pote 200 g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']
What I want to do is to be able to remove the whitespaces that are between the value and it is unit.
For example in this guy banha frimesa 450 gr, I want it to become banha frimesa 450gr.
But I also need to avoid removing whitespaces that are between a digit and digit with unit.
For example, this guy farinha de trigo especial 101 5kg** should stay the same.
What kind of regex should I use to only remove the whitespace that are between the kg,ml,l,g unit and it is value?
Wanted Result:
['banha frimesa 450gr','manteiga com sal tourinho pote 200g','acucar refinado caravelas pacote 1kg',
'acucar refinado light uniao fit pacote 500g','farinha de trigo especial 101 5kg']
答案1
得分: 2
你可以替换由数字前导和字母后跟的空格(还可以在前瞻中指定所有可能的单位)。
英文:
You could replace whitespace preceded by a digit and followed by a letter (you could also specify all the possible units in the lookahead).
r'(?<=\d)\s+(?=[a-zA-Z])'
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论