如何在字符串中去除标点符号?

huangapple go评论62阅读模式
英文:

How to remove punctuation within a string?

问题

我正在为我的pandas数据框做文本清洗。

这是在去除标点符号之前,从我的描述列中提取的字符串:

['dedicated', 'to', 'support', 'the', 'fast-paced', 'technology', 'lifestyle', 'needs', 'of', 'today', '’', 's', 'modern', 'society', '.', 'gadget', 'mix', 'have', 'the', 'benefit', 'of', '“', 'efficient', 'life', 'â€', 'tied', 'to', 'the', 'products', 'and', 'services', '.']

这是在我应用下面的代码之后字符串的样子:

['dedicated', 'to', 'support', 'the', 'fast-paced', 'technology', 'lifestyle', 'needs', 'of', 'today', '’', 's', 'modern', 'society', 'gadget', 'mix', 'have', 'the', 'benefit', 'of', '“', 'efficient', 'life', 'â€', 'tied', 'to', 'the', 'products', 'and', 'services', 'they', 'provide']

这是我的代码:

#去除标点符号
import string
punc=string.punctuation
updated_mall['Cleansed_description']=updated_mall['Cleansed_description'].apply(lambda x: [word for word in x if word not in punc])
updated_mall.head(105)

这段代码确实去除了标点符号,但是除了像"Fast-paced","...","restaurant/catering"这样的词。除此之外,在去除标点符号并将单词转换为小写后,像"Asia's"变成了'asia'和's'。

我被告知这只是检查整个字符串是否为标点符号,而不是检查字符串中的每个单词是否包含标点符号。

英文:

I am doing text cleaning for my pandas dataframe

This is a string from my description column before punctuation is removed:

['dedicated', 'to', 'support', 'the', 'fast-paced', 'technology', 
'lifestyle', 'needs', 'of', 'today', '’', 's', 'modern', 'society', 
'.', 'gadget', 'mix', 'have', 'the', 'benefit', 'of', '“', 
'efficient', 'life', 'â€', 'tied', 'to', 'the', 'products', 'and', 
'services', 'they', 'provide', '.']

This is how the string look like after i applied the code below:

['dedicated', 'to', 'support', 'the', 'fast-paced', 'technology', 
'lifestyle', 'needs', 'of', 'today', '’', 's', 'modern', 'society', 
'gadget', 'mix', 'have', 'the', 'benefit', 'of', '“', 'efficient', 
'life', 'â€', 'tied', 'to', 'the', 'products', 'and', 'services', 
'they', 'provide']

This is my code:

#removing punctuation
import string
punc=string.punctuation
updated_mall['Cleansed_description']=update_mall['Cleansed_description'].apply(lambdax: [word for word in x if word not in punc])
update_mall.head(105)

This code did remove punctuation except:

words like "Fast-paced","...","restaurant/catering".

Other than that,after punctuation removal and changing to lower casing words like Asia's became 'asia' and 's.

I was told that this only check an entire string if is a punctuation instead of checking every single word in a string for punctuation.

答案1

得分: 1

可以尝试使用正则表达式来运行以下代码:

import re

updated_mall['Cleansed_description'] = updated_mall['Cleansed_description'].apply(lambda x: [re.sub(r'[^\w\d\s]', ' ', word.lower()) for word in x])

update_mall.head(105)
英文:

Can you try the below code using regex

import re

updated_mall['Cleansed_description']=update_mall['Cleansed_description'].apply(lambda x: [re.sub(r'[^\w\d\s]', ' ', word.lower()) for word in x])

update_mall.head(105)

huangapple
  • 本文由 发表于 2023年2月8日 20:02:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/75385527.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定