2023年6月26日 00:32:12go评论77阅读模式

英文:

How to tokenize the list without getting extra spaces and commas (Python)

问题

Here is the translated code part with the requested modifications:

df = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e'],
                   'title': ['amd ryzen 7 5800x cpu processor', 'amd ryzen 8 5200x cpu processor',
                             'amd ryzen 5 2400x cpu processor', 'amd ryzen computer accessories for processor',
                             'amd ryzen cpu processor for gamers'],
                   'gen_key': ['amd, ryzen, processor, cpu, gamer', 'amd, ryzen, processor, cpu, gamer',
                               'amd, ryzen, processor, cpu, gamer', 'amd, ryzen, processor, cpu, gamer',
                               'amd, ryzen, processor, cpu, gamer'],
                   'elas_key': ['ryzen-7, best processor for processing, sale now for christmas gift',
                                'ryzen-8, GAMER, best processor for processing, sale now for christmas gift',
                                'ryzen-5, best processor for gamers, sale now for christmas gift, amd',
                                'ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion',
                                'processor, RYZEN, gamers best world, available on sale']})

def lower_case(new_keys):
    lower = list(w.lower() for w in new_keys)
    return lower

stop = stopwords.words('english')
other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
stop += other_claims

def stopwords_removal(new_keys):
    stop_removed = [' '.join([word for word in x.split() if word not in stop]) for x in new_keys]
    return stop_removed

def remove_specific_punkt(new_keys):
    punkt = list(filter(None, [re.sub(r'[;:-]', r'', i) for i in new_keys]))
    return punkt

df['elas_key'] = df['elas_key'].apply(remove_specific_punkt)
df

def word_tokenizing(new_keys):
    tokenized_words = [word_tokenize(i) for i in new_keys]
    return tokenized_words

df['elas_key'] = df['elas_key'].apply(word_tokenizing)
df

I've translated the provided code for you. Please note that the code itself appears to be fine for preprocessing your DataFrame, and the translation is based on your request to only return the translated code portion. If you have any specific questions or need further assistance, feel free to ask.

英文:

df = pd.DataFrame({&#39;id&#39; : [&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;d&#39;,&#39;e&#39;],
              &#39;title&#39; : [&#39;amd ryzen 7 5800x cpu processor&#39;, &#39;amd ryzen 8 5200x cpu processor&#39;,&#39;amd ryzen 5 2400x cpu processor&#39;,
                        &#39;amd ryzen computer accessories for processor&#39;,&#39;amd ryzen cpu processor for gamers&#39;],
              &#39;gen_key&#39; : [&#39;amd, ryzen, processor, cpu, gamer&#39;,&#39;amd, ryzen, processor, cpu, gamer&#39;,&#39;amd, ryzen, processor, cpu, gamer&#39;,
                          &#39;amd, ryzen, processor, cpu, gamer&#39;,&#39;amd, ryzen, processor, cpu, gamer&#39;],
              &#39;elas_key&#39; : [&#39;ryzen-7, best processor for processing, sale now for christmas gift&#39;,
                           &#39;ryzen-8, GAMER, best processor for processing, sale now for christmas gift&#39;,
                           &#39;ryzen-5, best processor for gamers, sale now for christmas gift, amd&#39;,
                           &#39;ryzen accessories, gamers:, headsets, pro; players best, hurry up to avail promotion&#39;,
                           &#39;processor, RYZEN, gamers best world, available on sale&#39;]})

So this is my dataframe, I am trying to do a preprocessing and get the final "elas_key" as a lowercase set without stopwords, specific punctuation marks, certain objective claims, plural nouns, duplicates from the "gen_key" and "title" and org names which are not there in the title. So i have processed certain things, but I am kind of stuck at tokenization, I am kind of getting extra spaces and commas when tokenizing the list:

def lower_case(new_keys):
  lower = list(w.lower() for w in new_keys)
  return lower 

stop = stopwords.words(&#39;english&#39;)
other_claims = [&#39;best&#39;,&#39;sale&#39;,&#39;available&#39;,&#39;avail&#39;,&#39;new&#39;,&#39;hurry&#39;,&#39;promotion&#39;]
stop += other_claims

def stopwords_removal(new_keys):
  stop_removed = [&#39; &#39;.join([word for word in x.split() if word not in stop]) for x in new_keys]
- return stop_removed

def remove_specific_punkt(new_keys):
  punkt = list(filter(None, [re.sub(r&#39;[;:-]&#39;, r&#39;&#39;, i) for i in new_keys]))
  return punkt
df[&#39;elas_key&#39;] = df[&#39;elas_key&#39;].apply(remove_specific_punkt)
df

after the removal of punctuation marks, I get the following table (named as List1):

but then when I run the tokenization script, I get a list of lists with added commas and spaces, I have tried using strip(), replace() to remove those, but nothing is giving me the expected result

def word_tokenizing(new_keys):
  tokenized_words = [word_tokenize(i) for i in new_keys]
  return tokenized_words
df[&#39;elas_key&#39;] = df[&#39;elas_key&#39;].apply(word_tokenizing)
df

the table is as follows (named as List2):

Can someone please help me out with this ?
Also after removing stopwords, I am getting some of the rows like this:

[processor, ryzen, gamers world,]

The actual list was:

[processor, ryzen, gamers best world, available on sale]

But the words like "available", "on", "sale" were either stopwords or other_claims, and even though the words are getting removed, but I am getting an additional "," at the end

My expected output should look something like this after removing stopwords, punctuation and other_claims:

[[ryzen,7, processor,processing]]
[[ryzen,8, gamer, processor,processing]]
[[ryzen,5, processor,gamers, amd]]
[[ryzen,accessories, gamers, headsets, pro,players]]
[[processor, ryzen, gamers,world]]

Like ryzen7 was a word it became ryzen,7
I am able to do it if the keywords are in multiple rows like:

[ryzen, 7]
[processor, processing]
[gamers, world]

So that it will be easier for me to pos_tag them

Apologies if the question is too confusing, I am kind of in learning stage

答案1

得分: 0

以下是代码部分的翻译：

from nltk import word_tokenize
from nltk.corpus import stopwords

other_claims = ['best', 'sale', 'available', 'avail', 'new', 'hurry', 'promotion']
STOPS = set(stopwords.words('english') + other_claims)
def remove_stops(words):
    if (words := [word for word in words if word not in STOPS]):
        return words

def word_tokenizing(words):
  return [token for word in words for token in word_tokenize(word)]

df['elas_key'] = (
    df['elas_key'].str.lower()
    .str.split(', ').explode()
    .str.replace(r'[;:-]', r' ', regex=True).str.strip()
    .str.split().map(remove_stops).dropna().str.join(' ')
    .groupby(level=0).agg(list)
    .map(word_tokenizing)
)

希望这对您有所帮助。

英文:

You could try the following:

from nltk import word_tokenize
from nltk.corpus import stopwords

other_claims = [&#39;best&#39;, &#39;sale&#39;, &#39;available&#39;, &#39;avail&#39;, &#39;new&#39;, &#39;hurry&#39;, &#39;promotion&#39;]
STOPS = set(stopwords.words(&#39;english&#39;) + other_claims)
def remove_stops(words):
    if (words := [word for word in words if word not in STOPS]):
        return words

def word_tokenizing(words):
  return [token for word in words for token in word_tokenize(word)]

df[&#39;elas_key&#39;] = (
    df[&#39;elas_key&#39;].str.lower()
    .str.split(&#39;, &#39;).explode()
    .str.replace(r&#39;[;:-]&#39;, r&#39; &#39;, regex=True).str.strip()
    .str.split().map(remove_stops).dropna().str.join(&#39; &#39;)
    .groupby(level=0).agg(list)
    .map(word_tokenizing)
)

Result for your sample dataframe (only column elas_key):

                                                    elas_key  
0         [ryzen, 7, processor, processing, christmas, gift]  
1  [ryzen, 8, gamer, processor, processing, christmas, gift]  
2        [ryzen, 5, processor, gamers, christmas, gift, amd]  
3       [ryzen, accessories, gamers, headsets, pro, players]  
4                          [processor, ryzen, gamers, world]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Python中对列表进行标记化而不生成额外的空格和逗号。

问题

答案1

如何在R中对存储在列表中的数据框的列名执行循环。

如何在Python中通过ID查找过去的数值

如何在 Laravel 9 项目中调用并传递信息给 Python 脚本？

Teardown method from add_finalizer of PyTest fixture doesn't work

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论