How to split a column of defined strings written without spaces in pandas, e.g. appleorange to apple orange?

huangapple go评论100阅读模式
英文:

How to split a column of defined strings written without spaces in pandas, e.g. appleorange to apple orange?

问题

我正在尝试用Python编写代码,以拆分Pandas数据框中的列值。该列将包含类似appleorangemango的值,我希望拆分为apple orange mango。我会根据一组大量唯一的单词进行拆分。

假设我有名为unique_fruits的数据框:

unique_fruits
mango
apple
orange
apricot
peach

还有另一个不带空格的水果数据框称为my_fruits

my_fruits
mango
apricotapple
orangemango peach
banana

请注意,banana不在unique_fruits数据框中。有时,列中可能包含空格,如orangemango peach。最后,该列可以是单个水果或空白,如my_fruits的第一行和第二行。

我打算读取一个Excel文件并将其保存到一个数据框中。然后,尝试找出我可以基于它们拆分的模式。如果我发现新内容,我将获得一个未知单词列表。我将手动添加未知单词的新拆分版本,然后重复此过程,直到我觉得一切都很完美或几乎完美。

未知单词的一个示例是bananastrawberrybananastrawberry都是我将添加到unique_fruits数据框中的新未知单词,然后重新运行代码。

如果我已经将pineapplepineapple添加到unique_fruits中,那么我宁愿将其显示为pineapple。我只会拆分,如果unique_fruits中没有pineapple

英文:

I am trying to write a code in python that splits a column value in a pandas dataframe. This column will have values like appleorangemango that I want to split to apple orange mango. I will have a large set of unique words that I will split with respect to them.

Assume that I have this dataframe called unique_fruits:

unique_fruits
mango
apple
orange
apricot
peach

And another dataframe of fruits that are written without spaces called my_fruits:

my_fruits
mango
apricotapple
orangemango peach
banana

Please note that banana is not in unique_fruits dataframe. Also, sometimes the column can contain spaces as in orangemango peach. Finally, the column can be a single fruit or blank as in the first and second rows of my_fruits.

I am planning to read an excel file and save it to a dataframe. Then, try to find out patterns that I can split based on them. If I found something new, then I will get a list of unknown words. I will manually add the new splitted versions of the unknown words and then repeat until I feel that everything is perfect or almost perfect.

An example of unknown words is bananastrawberry. Both of banana and strawberry are new unknown words that I will add to unique_fruits dataframe then re-run the code.

If I have pine, apple and pineapple added to unique_fruits, then I prefer to have it as pineapple. I will split only if I don't have pineapple in unique_fruits.

答案1

得分: 2

你可以使用你的 unique_fruits 创建一个正则表达式,将元素按照长度降序排序,这样较长的水果将首先匹配(这将优先匹配 pineapple 而不是 pineapple),然后使用这个正则表达式来拆分 my_fruits 中的字符串,然后再用空格将它们连接在一起:

uf = df1['unique_fruits'].to_list()
uf.sort(key=lambda v: -len(v))
# ['apricot', 'orange', 'mango', 'apple', 'peach']

regex = r'(' + '|'.join(list(map(re.escape, uf))) + r')|\s+'
# '(apricot|orange|mango|apple|peach)|\\s+'

df['my_fruits'] = df['my_fruits'].apply(lambda s: ' '.join(filter(None, re.split(regex, s))))
#             my_fruits
# 0               mango
# 1       apricot apple
# 2  orange mango peach
# 3              banana
英文:

You can create a regex out of your unique_fruits, with the elements sorted by length descending so that longer fruits are first (this will then match pineapple in preference to pine and apple), and then use that to split the strings in my_fruits, before joining them back together with a space:

uf = df1['unique_fruits'].to_list()
uf.sort(key=lambda v:-len(v))
# ['apricot', 'orange', 'mango', 'apple', 'peach']

regex = r'(' + '|'.join(list(map(re.escape, uf))) + r')|\s+'
# '(apricot|orange|mango|apple|peach)|\\s+'

df['my_fruits'] = df['my_fruits'].apply(lambda s:' '.join(filter(None, re.split(regex, s))))
#             my_fruits
# 0               mango
# 1       apricot apple
# 2  orange mango peach
# 3              banana

huangapple
  • 本文由 发表于 2023年5月7日 04:06:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76190897.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定