如何获取带有随机前缀和后缀的名称

huangapple go评论64阅读模式
英文:

How to get a name with random prefixes and suffixes

问题

我有一个相对独特的问题,但我毫无头绪从哪里开始。我正在使用Python。

所以,我试图从两个API中获取关于物品的大量信息,这两个API使用两种不同的ID方法。

名称和ID

名称将看起来像这样:Divan的头盔

ID将如下所示:DIVAN_HELMET

对我来说,将它们连接在字典中很容易。我的问题是有时名称会有后缀和前缀。比如:

Divan的智慧头盔Divan的清洁头盔,甚至包含Unicode字符,如✪ Divan的头盔 ✪

我想从这些名称中获取ID DIVAN_HELMET,但我无法确定前缀有多少个字符,甚至是否有后缀/前缀。我需要批量处理超过3,000个物品,其中包含数十个后缀和前缀。

英文:

I have a semi unique problem and I have no clue where to start. I'm using python

So im trying to get a bunch of info about items off two API's and these API uses two different id methods

Name and ID

The Name will look something like: Helmet of Divan

The ID will look like: DIVAN_HELMET

This is easy for me connect the two in a dictionary. My problem is sometimes the names will have suffixes and prefixs. Such as:

Wise Helmet of Divan or Clean Helmet of Divan or even have Unicode like ✪ Helmet of Divan ✪.

I want to get the ID DIVAN_HELMET from these names, but I can't know how many characters the prefix is or even if there is a suffix/prefix. I need to do this in mass for over 3 thousand items with dozens of suffixes and prefixes.

答案1

得分: 0

# 你想要从以下输入中获得这样的输出:```DIVAN_HELMET```

# 从这样的输入:```Wise Helmet of Divan``` 或 ```Clean Helmet of Divan``` 或 ```✪ Helmet of Divan ✪```

# 首先,您可以删除所有非ASCII字符,例如:[此答案](https://stackoverflow.com/a/8689826/3706717):

import string
printable = set(string.printable)
str_input = ''.join(filter(lambda x: x in printable, str_input))

# 然后,您需要将它们全部转换为小写,如```str_input = str_input.lower()```

# 接下来,您需要对输入进行 [tokenize](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization);最简单的方法就是通过空格拆分它,例如:```arr_str_input = str_input.split(" ")```

# 然后,您需要移除 [stopwords](https://en.wikipedia.org/wiki/Stop_word) 如 'of' 或 'the'。对于这一步,您可以使用公开可用的停用词列表,例如[这个](https://github.com/stopwords-iso/stopwords-en/blob/master/stopwords-en.txt),或者如果您的输入文本中只有一个停用词 'of',也可以硬编码删除它,例如:```arr_str_input.remove("of")```

# 接下来,您需要移除前缀或后缀。在这一步中,您可以自己提供所有前缀/后缀的列表,或者使用已经准备好的列表,例如[这个](https://github.com/Rayraegah/adjectives)(要小心,因为这个列表可能非常大)

# 完成所有这些步骤后,您应该得到一个仅包含2个词的列表/数组,例如```['helmet', 'divan']```。最后一步应该只是排列它们并将它们转换为大写,例如:

result = ['helmet', 'divan']
result.reverse()
print('_'.join(result).upper())
# 输出 DIVAN_HELMET
英文:

So you want to get such output: DIVAN_HELMET

From such inputs: Wise Helmet of Divan or Clean Helmet of Divan or ✪ Helmet of Divan ✪

First you can remove all non-ASCII characters, e.g. like this answer :

import string
printable = set(string.printable)
str_input = ''.join(filter(lambda x: x in printable, str_input))

Then you need to convert them to all lowercase like str_input = str_input.lower()

Then, you need to tokenize the input; the easiest way is just to split it by space, e.g.: arr_str_input = str_input.split(" ")

Then you need to remove the stopwords like 'of' or 'the'. For this step you can use publicly available stopword list like this or just hardcode removal of word 'of' if that's all the stopword in your input text. e.g.: arr_str_input.remove("of")

Then you need to remove the prefix or suffix. In this step you can just supply the list of all prefix/suffix yourself or use readily made one like this (be careful since it can be very big list)

After all that, you should have a list/array of only 2 word like ['helmet','divan']. Last step should be just arranging them and making them uppercase, e.g.:

result = ['helmet','divan']
result.reverse()
print('_'.join(result).upper())
# outputs DIVAN_HELMET

huangapple
  • 本文由 发表于 2023年7月23日 14:48:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76746948.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定