如何在Python中对列表进行标记化而不生成额外的空格和逗号。

huangapple go评论71阅读模式
英文:

How to tokenize the list without getting extra spaces and commas (Python)

问题

Here is the translated code part with the requested modifications:

df = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e'],
                   'title': ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor',
                             'amd ryzen 5 2400x cpu processor', 'amd ryzen computer accessories for processor',
                             'amd ryzen cpu processor for gamers'],
                   'gen_key': ['amd, ryzen, processor, cpu, gamer', 'amd, ryzen, processor, cpu, gamer',
                               'amd, ryzen, processor, cpu, gamer', 'amd, ryzen, processor, cpu, gamer',
                               'amd, ryzen, processor, cpu, gamer'],
                   'elas_key': ['ryzen-7, best processor for processing, sale now for christmas gift',
                                'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
                                'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
                                'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
                                'processor, RYZEN, gamers best world, available on sale']})

def lower_case(new_keys):
    lower = list(w.lower() for w in new_keys)
    return lower

stop = stopwords.words('english')
other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
stop += other_claims

def stopwords_removal(new_keys):
    stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
    return stop_removed

def remove_specific_punkt(new_keys):
    punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
    return punkt

df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df

def word_tokenizing(new_keys):
    tokenized_words = [word_tokenize(i) for i in new_keys]
    return tokenized_words

df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df

I've translated the provided code for you. Please note that the code itself appears to be fine for preprocessing your DataFrame, and the translation is based on your request to only return the translated code portion. If you have any specific questions or need further assistance, feel free to ask.

英文:
df = pd.DataFrame({'id' : ['a','b','c','d','e'],
              'title' : ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor','amd ryzen 5 2400x cpu processor',
                        'amd ryzen computer accessories for processor','amd ryzen cpu processor for gamers'],
              'gen_key' : ['amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer',
                          'amd, ryzen, processor, cpu, gamer','amd, ryzen, processor, cpu, gamer'],
              'elas_key' : ['ryzen-7, best processor for processing, sale now for christmas gift',
                           'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
                           'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
                           'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
                           'processor, RYZEN, gamers best world, available on sale']})

So this is my dataframe, I am trying to do a preprocessing and get the final "elas_key" as a lowercase set without stopwords, specific punctuation marks, certain objective claims, plural nouns, duplicates from the "gen_key" and "title" and org names which are not there in the title. So i have processed certain things, but I am kind of stuck at tokenization, I am kind of getting extra spaces and commas when tokenizing the list:

def lower_case(new_keys):
  lower = list(w.lower() for w in new_keys)
  return lower 

stop = stopwords.words('english')
other_claims = ['best','sale','available','avail','new','hurry','promotion']
stop += other_claims

def stopwords_removal(new_keys):
  stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
- return stop_removed

def remove_specific_punkt(new_keys):
  punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
  return punkt
df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df

after the removal of punctuation marks, I get the following table (named as List1):
如何在Python中对列表进行标记化而不生成额外的空格和逗号。

but then when I run the tokenization script, I get a list of lists with added commas and spaces, I have tried using strip(), replace() to remove those, but nothing is giving me the expected result

def word_tokenizing(new_keys):
  tokenized_words = [word_tokenize(i) for i in new_keys]
  return tokenized_words
df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df

the table is as follows (named as List2):
如何在Python中对列表进行标记化而不生成额外的空格和逗号。

Can someone please help me out with this ?
Also after removing stopwords, I am getting some of the rows like this:

[processor, ryzen, gamers world,]

The actual list was:

[processor, ryzen, gamers best world, available on sale]

But the words like "available", "on", "sale" were either stopwords or other_claims, and even though the words are getting removed, but I am getting an additional "," at the end

My expected output should look something like this after removing stopwords, punctuation and other_claims:

[[ryzen,7, processor,processing]]
[[ryzen,8, gamer, processor,processing]]
[[ryzen,5, processor,gamers, amd]]
[[ryzen,accessories, gamers, headsets, pro,players]]
[[processor, ryzen, gamers,world]]

Like ryzen7 was a word it became ryzen,7
I am able to do it if the keywords are in multiple rows like:

[ryzen, 7]
[processor, processing]
[gamers, world]

So that it will be easier for me to pos_tag them

Apologies if the question is too confusing, I am kind of in learning stage

答案1

得分: 0

以下是代码部分的翻译:

from nltk import word_tokenize
from nltk.corpus import stopwords

other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
STOPS = set(stopwords.words('english') + other_claims)
def remove_stops(words):
    if (words := [word for word in words if word not in STOPS]):
        return words

def word_tokenizing(words):
  return [token for word in words for token in word_tokenize(word)]

df['elas_key'] = (
    df['elas_key'].str.lower()
    .str.split(', ').explode()
    .str.replace(r'[;:-]', r' ', regex=True).str.strip()
    .str.split().map(remove_stops).dropna().str.join(' ')
    .groupby(level=0).agg(list)
    .map(word_tokenizing)
)

希望这对您有所帮助。

英文:

You could try the following:

from nltk import word_tokenize
from nltk.corpus import stopwords

other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
STOPS = set(stopwords.words('english') + other_claims)
def remove_stops(words):
    if (words := [word for word in words if word not in STOPS]):
        return words

def word_tokenizing(words):
  return [token for word in words for token in word_tokenize(word)]

df['elas_key'] = (
    df['elas_key'].str.lower()
    .str.split(', ').explode()
    .str.replace(r'[;:-]', r' ', regex=True).str.strip()
    .str.split().map(remove_stops).dropna().str.join(' ')
    .groupby(level=0).agg(list)
    .map(word_tokenizing)
)

Result for your sample dataframe (only column elas_key):

                                                    elas_key  
0         [ryzen, 7, processor, processing, christmas, gift]  
1  [ryzen, 8, gamer, processor, processing, christmas, gift]  
2        [ryzen, 5, processor, gamers, christmas, gift, amd]  
3       [ryzen, accessories, gamers, headsets, pro, players]  
4                          [processor, ryzen, gamers, world]

huangapple
  • 本文由 发表于 2023年6月26日 00:32:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76551430.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定