英文:
How to split a column of defined strings written without spaces in pandas, e.g. appleorange to apple orange?
问题
我正在尝试用Python编写代码,以拆分Pandas数据框中的列值。该列将包含类似appleorangemango
的值,我希望拆分为apple orange mango
。我会根据一组大量唯一的单词进行拆分。
假设我有名为unique_fruits
的数据框:
unique_fruits |
---|
mango |
apple |
orange |
apricot |
peach |
还有另一个不带空格的水果数据框称为my_fruits
:
my_fruits |
---|
mango |
apricotapple |
orangemango peach |
banana |
请注意,banana
不在unique_fruits
数据框中。有时,列中可能包含空格,如orangemango peach
。最后,该列可以是单个水果或空白,如my_fruits
的第一行和第二行。
我打算读取一个Excel文件并将其保存到一个数据框中。然后,尝试找出我可以基于它们拆分的模式。如果我发现新内容,我将获得一个未知单词列表。我将手动添加未知单词的新拆分版本,然后重复此过程,直到我觉得一切都很完美或几乎完美。
未知单词的一个示例是bananastrawberry
。banana
和strawberry
都是我将添加到unique_fruits
数据框中的新未知单词,然后重新运行代码。
如果我已经将pine
,apple
和pineapple
添加到unique_fruits
中,那么我宁愿将其显示为pineapple
。我只会拆分,如果unique_fruits
中没有pineapple
。
英文:
I am trying to write a code in python that splits a column value in a pandas dataframe. This column will have values like appleorangemango
that I want to split to apple orange mango
. I will have a large set of unique words that I will split with respect to them.
Assume that I have this dataframe called unique_fruits
:
unique_fruits |
---|
mango |
apple |
orange |
apricot |
peach |
And another dataframe of fruits that are written without spaces called my_fruits
:
my_fruits |
---|
mango |
apricotapple |
orangemango peach |
banana |
Please note that banana
is not in unique_fruits
dataframe. Also, sometimes the column can contain spaces as in orangemango peach
. Finally, the column can be a single fruit or blank as in the first and second rows of my_fruits
.
I am planning to read an excel file and save it to a dataframe. Then, try to find out patterns that I can split based on them. If I found something new, then I will get a list of unknown words. I will manually add the new splitted versions of the unknown words and then repeat until I feel that everything is perfect or almost perfect.
An example of unknown words is bananastrawberry
. Both of banana
and strawberry
are new unknown words that I will add to unique_fruits
dataframe then re-run the code.
If I have pine
, apple
and pineapple
added to unique_fruits
, then I prefer to have it as pineapple
. I will split only if I don't have pineapple
in unique_fruits
.
答案1
得分: 2
你可以使用你的 unique_fruits
创建一个正则表达式,将元素按照长度降序排序,这样较长的水果将首先匹配(这将优先匹配 pineapple
而不是 pine
和 apple
),然后使用这个正则表达式来拆分 my_fruits
中的字符串,然后再用空格将它们连接在一起:
uf = df1['unique_fruits'].to_list()
uf.sort(key=lambda v: -len(v))
# ['apricot', 'orange', 'mango', 'apple', 'peach']
regex = r'(' + '|'.join(list(map(re.escape, uf))) + r')|\s+'
# '(apricot|orange|mango|apple|peach)|\\s+'
df['my_fruits'] = df['my_fruits'].apply(lambda s: ' '.join(filter(None, re.split(regex, s))))
# my_fruits
# 0 mango
# 1 apricot apple
# 2 orange mango peach
# 3 banana
英文:
You can create a regex out of your unique_fruits
, with the elements sorted by length descending so that longer fruits are first (this will then match pineapple
in preference to pine
and apple
), and then use that to split the strings in my_fruits
, before joining them back together with a space:
uf = df1['unique_fruits'].to_list()
uf.sort(key=lambda v:-len(v))
# ['apricot', 'orange', 'mango', 'apple', 'peach']
regex = r'(' + '|'.join(list(map(re.escape, uf))) + r')|\s+'
# '(apricot|orange|mango|apple|peach)|\\s+'
df['my_fruits'] = df['my_fruits'].apply(lambda s:' '.join(filter(None, re.split(regex, s))))
# my_fruits
# 0 mango
# 1 apricot apple
# 2 orange mango peach
# 3 banana
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论