2023年5月7日 04:06:12go评论149阅读模式

英文:

How to split a column of defined strings written without spaces in pandas, e.g. appleorange to apple orange?

问题

我正在尝试用Python编写代码，以拆分Pandas数据框中的列值。该列将包含类似appleorangemango的值，我希望拆分为apple orange mango。我会根据一组大量唯一的单词进行拆分。

假设我有名为unique_fruits的数据框：

unique_fruits
mango
apple
orange
apricot
peach

还有另一个不带空格的水果数据框称为my_fruits：

my_fruits
mango

apricotapple
orangemango peach
banana

请注意，banana不在unique_fruits数据框中。有时，列中可能包含空格，如orangemango peach。最后，该列可以是单个水果或空白，如my_fruits的第一行和第二行。

我打算读取一个Excel文件并将其保存到一个数据框中。然后，尝试找出我可以基于它们拆分的模式。如果我发现新内容，我将获得一个未知单词列表。我将手动添加未知单词的新拆分版本，然后重复此过程，直到我觉得一切都很完美或几乎完美。

未知单词的一个示例是bananastrawberry。banana和strawberry都是我将添加到unique_fruits数据框中的新未知单词，然后重新运行代码。

如果我已经将pine，apple和pineapple添加到unique_fruits中，那么我宁愿将其显示为pineapple。我只会拆分，如果unique_fruits中没有pineapple。

英文:

I am trying to write a code in python that splits a column value in a pandas dataframe. This column will have values like appleorangemango that I want to split to apple orange mango. I will have a large set of unique words that I will split with respect to them.

Assume that I have this dataframe called unique_fruits:

unique_fruits
mango
apple
orange
apricot
peach

And another dataframe of fruits that are written without spaces called my_fruits:

my_fruits
mango

apricotapple
orangemango peach
banana

Please note that banana is not in unique_fruits dataframe. Also, sometimes the column can contain spaces as in orangemango peach. Finally, the column can be a single fruit or blank as in the first and second rows of my_fruits.

I am planning to read an excel file and save it to a dataframe. Then, try to find out patterns that I can split based on them. If I found something new, then I will get a list of unknown words. I will manually add the new splitted versions of the unknown words and then repeat until I feel that everything is perfect or almost perfect.

An example of unknown words is bananastrawberry. Both of banana and strawberry are new unknown words that I will add to unique_fruits dataframe then re-run the code.

If I have pine, apple and pineapple added to unique_fruits, then I prefer to have it as pineapple. I will split only if I don't have pineapple in unique_fruits.

答案1

得分: 2

你可以使用你的 unique_fruits 创建一个正则表达式，将元素按照长度降序排序，这样较长的水果将首先匹配（这将优先匹配 pineapple 而不是 pine 和 apple），然后使用这个正则表达式来拆分 my_fruits 中的字符串，然后再用空格将它们连接在一起：

uf = df1['unique_fruits'].to_list()
uf.sort(key=lambda v: -len(v))
# ['apricot', 'orange', 'mango', 'apple', 'peach']
regex = r'(' + '|'.join(list(map(re.escape, uf))) + r')|\s+'
# '(apricot|orange|mango|apple|peach)|\\s+'
df['my_fruits'] = df['my_fruits'].apply(lambda s: ' '.join(filter(None, re.split(regex, s))))
#             my_fruits
# 0               mango
# 1       apricot apple
# 2  orange mango peach
# 3              banana

英文:

You can create a regex out of your unique_fruits, with the elements sorted by length descending so that longer fruits are first (this will then match pineapple in preference to pine and apple), and then use that to split the strings in my_fruits, before joining them back together with a space:

uf = df1[&#39;unique_fruits&#39;].to_list()
uf.sort(key=lambda v:-len(v))
# [&#39;apricot&#39;, &#39;orange&#39;, &#39;mango&#39;, &#39;apple&#39;, &#39;peach&#39;]
regex = r&#39;(&#39; + &#39;|&#39;.join(list(map(re.escape, uf))) + r&#39;)|\s+&#39;
# &#39;(apricot|orange|mango|apple|peach)|\\s+&#39;
df[&#39;my_fruits&#39;] = df[&#39;my_fruits&#39;].apply(lambda s:&#39; &#39;.join(filter(None, re.split(regex, s))))
#             my_fruits
# 0               mango
# 1       apricot apple
# 2  orange mango peach
# 3              banana

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to split a column of defined strings written without spaces in pandas, e.g. appleorange to apple orange?

问题

答案1

如何在更新属性时验证对象仍然符合这些条件？

如何在基于Java的Spring Boot项目中调用内部的Python服务

Pyside6多线程OpenCV网络摄像头

理解简单正弦波的MFCC输出

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。