英文:
How to tokenize the list without getting extra spaces and commas (Python)
问题
Here is the translated code part with the requested modifications:
df = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e'],
'title': ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor',
'amd ryzen 5 2400x cpu processor', 'amd ryzen computer accessories for processor',
'amd ryzen cpu processor for gamers'],
'gen_key': ['amd, ryzen, processor, cpu, gamer', 'amd, ryzen, processor, cpu, gamer',
'amd, ryzen, processor, cpu, gamer', 'amd, ryzen, processor, cpu, gamer',
'amd, ryzen, processor, cpu, gamer'],
'elas_key': ['ryzen-7, best processor for processing, sale now for christmas gift',
'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
'processor, RYZEN, gamers best world, available on sale']})
def lower_case(new_keys):
lower = list(w.lower() for w in new_keys)
return lower
stop = stopwords.words('english')
other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
stop += other_claims
def stopwords_removal(new_keys):
stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
return stop_removed
def remove_specific_punkt(new_keys):
punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
return punkt
df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df
def word_tokenizing(new_keys):
tokenized_words = [word_tokenize(i) for i in new_keys]
return tokenized_words
df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df
I've translated the provided code for you. Please note that the code itself appears to be fine for preprocessing your DataFrame, and the translation is based on your request to only return the translated code portion. If you have any specific questions or need further assistance, feel free to ask.
英文:
df = pd.DataFrame({'id' : ['a','b','c','d','e'],
'title' : ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor','amd ryzen 5 2400x cpu processor',
'amd ryzen computer accessories for processor','amd ryzen cpu processor for gamers'],
'gen_key' : ['amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer',
'amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer'],
'elas_key' : ['ryzen-7, best processor for processing, sale now for christmas gift',
'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
'processor, RYZEN, gamers best world, available on sale']})
So this is my dataframe, I am trying to do a preprocessing and get the final "elas_key" as a lowercase set without stopwords, specific punctuation marks, certain objective claims, plural nouns, duplicates from the "gen_key" and "title" and org names which are not there in the title. So i have processed certain things, but I am kind of stuck at tokenization, I am kind of getting extra spaces and commas when tokenizing the list:
def lower_case(new_keys):
lower = list(w.lower() for w in new_keys)
return lower
stop = stopwords.words('english')
other_claims = ['best','sale','available','avail','new','hurry','promotion']
stop += other_claims
def stopwords_removal(new_keys):
stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
- return stop_removed
def remove_specific_punkt(new_keys):
punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
return punkt
df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df
after the removal of punctuation marks, I get the following table (named as List1):
but then when I run the tokenization script, I get a list of lists with added commas and spaces, I have tried using strip(), replace() to remove those, but nothing is giving me the expected result
def word_tokenizing(new_keys):
tokenized_words = [word_tokenize(i) for i in new_keys]
return tokenized_words
df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df
the table is as follows (named as List2):
Can someone please help me out with this ?
Also after removing stopwords, I am getting some of the rows like this:
[processor, ryzen, gamers world,]
The actual list was:
[processor, ryzen, gamers best world, available on sale]
But the words like "available", "on", "sale" were either stopwords or other_claims, and even though the words are getting removed, but I am getting an additional "," at the end
My expected output should look something like this after removing stopwords, punctuation and other_claims:
[[ryzen,7, processor,processing]]
[[ryzen,8, gamer, processor,processing]]
[[ryzen,5, processor,gamers, amd]]
[[ryzen,accessories, gamers, headsets, pro,players]]
[[processor, ryzen, gamers,world]]
Like ryzen7 was a word it became ryzen,7
I am able to do it if the keywords are in multiple rows like:
[ryzen, 7]
[processor, processing]
[gamers, world]
So that it will be easier for me to pos_tag them
Apologies if the question is too confusing, I am kind of in learning stage
答案1
得分: 0
以下是代码部分的翻译:
from nltk import word_tokenize
from nltk.corpus import stopwords
other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
STOPS = set(stopwords.words('english') + other_claims)
def remove_stops(words):
if (words := [word for word in words if word not in STOPS]):
return words
def word_tokenizing(words):
return [token for word in words for token in word_tokenize(word)]
df['elas_key'] = (
df['elas_key'].str.lower()
.str.split(', ').explode()
.str.replace(r'[;:-]', r' ', regex=True).str.strip()
.str.split().map(remove_stops).dropna().str.join(' ')
.groupby(level=0).agg(list)
.map(word_tokenizing)
)
希望这对您有所帮助。
英文:
You could try the following:
from nltk import word_tokenize
from nltk.corpus import stopwords
other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
STOPS = set(stopwords.words('english') + other_claims)
def remove_stops(words):
if (words := [word for word in words if word not in STOPS]):
return words
def word_tokenizing(words):
return [token for word in words for token in word_tokenize(word)]
df['elas_key'] = (
df['elas_key'].str.lower()
.str.split(', ').explode()
.str.replace(r'[;:-]', r' ', regex=True).str.strip()
.str.split().map(remove_stops).dropna().str.join(' ')
.groupby(level=0).agg(list)
.map(word_tokenizing)
)
Result for your sample dataframe (only column elas_key
):
elas_key
0 [ryzen, 7, processor, processing, christmas, gift]
1 [ryzen, 8, gamer, processor, processing, christmas, gift]
2 [ryzen, 5, processor, gamers, christmas, gift, amd]
3 [ryzen, accessories, gamers, headsets, pro, players]
4 [processor, ryzen, gamers, world]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论