2023年6月16日 15:40:44go评论158阅读模式

英文:

Splitting the elements of a list by some separator in the same list

问题

我有一个数组：
```python
array([nan, '&#39;紧张的一天&#39;', '&#39;喝咖啡:喝茶&#39;', '&#39;喝茶&#39;',
       '&#39;吃晚餐:喝咖啡&#39;', '&#39;喝咖啡:喝茶:锻炼&#39;', '&#39;喝茶:锻炼&#39;',
       '&#39;喝咖啡:喝茶:紧张的一天&#39;', '&#39;喝咖啡&#39;',
       '&#39;喝咖啡:喝茶:紧张的一天:锻炼&#39;', '&#39;喝咖啡:锻炼&#39;',
       '&#39;吃晚餐:喝咖啡:喝茶&#39;', '&#39;吃晚餐:喝咖啡:喝茶:锻炼&#39;',
       '&#39;喝茶:紧张的一天&#39;', '&#39;喝茶:紧张的一天:锻炼&#39;',
       '&#39;喝咖啡:紧张的一天:锻炼&#39;', '&#39;喝咖啡:紧张的一天&#39;',
       '&#39;吃晚餐:喝咖啡:喝茶:紧张的一天&#39;', '&#39;锻炼&#39;',
       '&#39;吃晚餐:喝咖啡:锻炼&#39;], dtype=object)

这些是来自数据框中某列的唯一值，

正如你所看到的，它们是其他值的组合，比如 ''喝咖啡:喝茶'' 是 ''喝咖啡'' 和 ''喝茶'' 的组合。我想要这个列表的唯一元素。

有没有在Python库中针对这种情况的内置函数可以快速创建这个列表？

期望输出：

array([nan, '&#39;紧张的一天&#39;', '&#39;喝咖啡&#39;', '&#39;喝茶&#39;', '&#39;吃晚餐&#39;',
       '&#39;锻炼&#39;], dtype=object)

英文:

I have an array:

array([nan, &#39;Stressful day&#39;, &#39;Drank coffee:Drank tea&#39;, &#39;Drank tea&#39;,
       &#39;Ate late:Drank coffee&#39;, &#39;Drank coffee:Drank tea:Worked out&#39;,
       &#39;Drank tea:Worked out&#39;, &#39;Drank coffee:Drank tea:Stressful day&#39;,
       &#39;Drank coffee&#39;, &#39;Drank coffee:Drank tea:Stressful day:Worked out&#39;,
       &#39;Drank coffee:Worked out&#39;, &#39;Ate late:Drank coffee:Drank tea&#39;,
       &#39;Ate late:Drank coffee:Drank tea:Worked out&#39;,
       &#39;Drank tea:Stressful day&#39;, &#39;Drank tea:Stressful day:Worked out&#39;,
       &#39;Drank coffee:Stressful day:Worked out&#39;,
       &#39;Drank coffee:Stressful day&#39;,
       &#39;Ate late:Drank coffee:Drank tea:Stressful day&#39;, &#39;Worked out&#39;,
       &#39;Ate late:Drank coffee:Worked out&#39;], dtype=object)

these are unique values from the column of a dataframe,

as you can see they are combination of other values like 'Drank coffee:Drank tea' is a combination of 'Drank coffee' and 'Drank tea'. I want those unique elements for this list.

What's the quickest way to create that list? Is there any inbuilt function in python libraries for this sort of thing?

Expected output:

array([nan, &#39;Stressful day&#39;, &#39;Drank coffee&#39;, &#39;Drank tea&#39;, &#39;Ate late&#39;,
       &#39;Worked out&#39;], dtype=object)

答案1

得分: 3

假设a是输入数组，你可以使用str.extractall：

out = pd.Series(a).str.extractall('([^:]+)')[0].unique()

从原始 Series s：

out = s.unique().drop_duplicates().str.extractall('([^:]+)')[0].unique()

输出：

array(['Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)

其他选项（可能效率较低）：

out = set(x for s in a if isinstance(s, str) for x in s.split(':'))

out = pd.Series(a).str.split(':').explode().unique()

保留 NaN 值：

s = pd.Series(a)
out = np.concatenate(展开收缩
.unique(),
                      s.str.extractall('([^:]+)')[0].unique()])

输出：

array([nan, 'Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)

或者：

out = set(x for s in a for x in (s.split(':') if isinstance(s, str) else 展开收缩))

输出：

{'Drank coffee', 'Drank tea', nan, 'Stressful day', 'Worked out', 'Ate late'}

英文:

Assuming a the input array, you could use str.extractall:

out = pd.Series(a).str.extractall(&#39;([^:]+)&#39;)[0].unique()

From the original Series s:

out = s.unique().drop_duplicates().str.extractall(&#39;([^:]+)&#39;)[0].unique()

Output:

array([&#39;Stressful day&#39;, &#39;Drank coffee&#39;, &#39;Drank tea&#39;, &#39;Ate late&#39;,
       &#39;Worked out&#39;], dtype=object)

Other options (maybe less efficient):

out = set(x for s in a if isinstance(s, str) for x in s.split(&#39;:&#39;))

out = pd.Series(a).str.split(&#39;:&#39;).explode().unique()

keeping NaNs:

s = pd.Series(a)
out = np.concatenate(展开收缩.unique(),
                      s.str.extractall(&#39;([^:]+)&#39;)[0].unique()])

Output:

array([nan, &#39;Stressful day&#39;, &#39;Drank coffee&#39;, &#39;Drank tea&#39;, &#39;Ate late&#39;,
       &#39;Worked out&#39;], dtype=object)

Or:

out = set(x for s in a for x in (s.split(&#39;:&#39;) if isinstance(s, str) else 展开收缩))

Output:

{&#39;Drank coffee&#39;, &#39;Drank tea&#39;, nan, &#39;Stressful day&#39;, &#39;Worked out&#39;, &#39;Ate late&#39;}

答案2

得分: 2

以下是一个使用Python和NumPy的解决方案。

首先，使用列表而不是对象数据类型数组更简单（数组层对此代码没有任何影响）：

alist = [np.nan, 'Stressful day', 'Drank coffee:Drank tea', 'Drank tea',
        'Ate late:Drank coffee', 'Drank coffee:Drank tea:Worked out',
        'Drank tea:Worked out', 'Drank coffee:Drank tea:Stressful day',
        'Drank coffee', 'Drank coffee:Drank tea:Stressful day:Worked out',
        'Drank coffee:Worked out', 'Ate late:Drank coffee:Drank tea',
        'Ate late:Drank coffee:Drank tea:Worked out',
        'Drank tea:Stressful day', 'Drank tea:Stressful day:Worked out',
        'Drank coffee:Stressful day:Worked out',
        'Drank coffee:Stressful day',
        'Ate late:Drank coffee:Drank tea:Stressful day', 'Worked out',
        'Ate late:Drank coffee:Worked out']

处理NaN是个问题，因为它是浮点数，而不是字符串：

blist = 展开收缩

浮点数无法进行split操作，而字符串无法测试是否为浮点数值。因此，让我们创建一个实用函数来捕获错误：

def foo(astr):
    try:
        return astr.split(':')
    except AttributeError:
        return [astr]   # makes extend easier

然后使用此函数创建blist：

blist = [foo(s) for s in alist]

接下来，使用extend来扁平化列表。你可以在blist创建时进行此操作：

clist = []
for l in blist:
    clist.extend(l)

然后，使用np.unique很容易进行唯一值处理：

u = np.unique(clist)

实际上，我们完全不需要NumPy，Python的集合（set）同样适用：

S = set(clist)

以上是你提供的代码的翻译。如果你有任何其他问题或需要进一步的帮助，请告诉我。

英文:

Here's a python plus numpy solution.

Starting with a list rather than an object dtype array is simpler (the array layer doesn't add anything to this code)

In [2]: alist =[np.nan, &#39;Stressful day&#39;, &#39;Drank coffee:Drank tea&#39;, &#39;Drank tea&#39;,
   ...:        &#39;Ate late:Drank coffee&#39;, &#39;Drank coffee:Drank tea:Worked out&#39;,
   ...:        &#39;Drank tea:Worked out&#39;, &#39;Drank coffee:Drank tea:Stressful day&#39;,
   ...:        &#39;Drank coffee&#39;, &#39;Drank coffee:Drank tea:Stressful day:Worked out&#39;,
   ...:        &#39;Drank coffee:Worked out&#39;, &#39;Ate late:Drank coffee:Drank tea&#39;,
   ...:        &#39;Ate late:Drank coffee:Drank tea:Worked out&#39;,
   ...:        &#39;Drank tea:Stressful day&#39;, &#39;Drank tea:Stressful day:Worked out&#39;,
   ...:        &#39;Drank coffee:Stressful day:Worked out&#39;,
   ...:        &#39;Drank coffee:Stressful day&#39;,
   ...:        &#39;Ate late:Drank coffee:Drank tea:Stressful day&#39;, &#39;Worked out&#39;,
   ...:        &#39;Ate late:Drank coffee:Worked out&#39;]

Handling the nan is problem, since it's a float, not a string:

In [3]: blist = 展开收缩
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

In [4]: blist = 展开收缩
---------------------------------------------------------------------------
AttributeError: &#39;float&#39; object has no attribute &#39;split&#39;

A float can't `split', and string can't be tested for a float value. So let's make a utility function to catch the error.

In [10]: def foo(astr):
    ...:     try:
    ...:         return astr.split(&#39;:&#39;)
    ...:     except AttributeError:
    ...:         return [astr]   # makes extend easier
    ...:         

In [11]: blist = [foo(s) for s in alist]

In [12]: blist
Out[12]: 
[[nan],
 [&#39;Stressful day&#39;],
 [&#39;Drank coffee&#39;, &#39;Drank tea&#39;],
 [&#39;Drank tea&#39;],
 [&#39;Ate late&#39;, &#39;Drank coffee&#39;],
 [&#39;Drank coffee&#39;, &#39;Drank tea&#39;, &#39;Worked out&#39;],
 [&#39;Drank tea&#39;, &#39;Worked out&#39;],
 [&#39;Drank coffee&#39;, &#39;Drank tea&#39;, &#39;Stressful day&#39;],
 [&#39;Drank coffee&#39;],
 [&#39;Drank coffee&#39;, &#39;Drank tea&#39;, &#39;Stressful day&#39;, &#39;Worked out&#39;],
 [&#39;Drank coffee&#39;, &#39;Worked out&#39;],
 ...
 [&#39;Worked out&#39;],
 [&#39;Ate late&#39;, &#39;Drank coffee&#39;, &#39;Worked out&#39;]]

And flatten the list with extend. I might have included this in the blist creation:

In [13]: clist = []
    ...: for l in blist:
    ...:     clist.extend(l)
    ...:     

In [14]: clist
Out[14]: 
[nan,
 &#39;Stressful day&#39;,
 &#39;Drank coffee&#39;,
 &#39;Drank tea&#39;,
 &#39;Drank tea&#39;,
 ...
 &#39;Worked out&#39;,
 &#39;Ate late&#39;,
 &#39;Drank coffee&#39;,
 &#39;Worked out&#39;]

Then it's easy to apply the np.unique.

In [15]: u = np.unique(clist)

In [16]: u
Out[16]: 
array([&#39;Ate late&#39;, &#39;Drank coffee&#39;, &#39;Drank tea&#39;, &#39;Stressful day&#39;,
       &#39;Worked out&#39;, &#39;nan&#39;], dtype=&#39;&lt;U32&#39;)

Actually we don't numpy at all, Python set will do just as well

In [17]: S = set(clist)
In [18]: S
Out[18]: {&#39;Ate late&#39;, &#39;Drank coffee&#39;, &#39;Drank tea&#39;, &#39;Stressful day&#39;, &#39;Worked out&#39;, nan}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将列表中的元素按照相同列表中的某个分隔符拆分

问题

答案1

keeping NaNs:

答案2

如何避免在GEKKO中创建许多二进制开关变量

修改柱状图中的X轴刻度。

有问题通过boto3下载S3存储桶对象。错误403 HeadObject：禁止。

无法处理的实体，使用 fastapi 发送 POST 请求？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论