将列表中的元素按照相同列表中的某个分隔符拆分

huangapple go评论65阅读模式
英文:

Splitting the elements of a list by some separator in the same list

问题

我有一个数组
```python
array([nan, ''紧张的一天'', ''喝咖啡:喝茶'', ''喝茶'',
       ''吃晚餐:喝咖啡'', ''喝咖啡:喝茶:锻炼'', ''喝茶:锻炼'',
       ''喝咖啡:喝茶:紧张的一天'', ''喝咖啡'',
       ''喝咖啡:喝茶:紧张的一天:锻炼'', ''喝咖啡:锻炼'',
       ''吃晚餐:喝咖啡:喝茶'', ''吃晚餐:喝咖啡:喝茶:锻炼'',
       ''喝茶:紧张的一天'', ''喝茶:紧张的一天:锻炼'',
       ''喝咖啡:紧张的一天:锻炼'', ''喝咖啡:紧张的一天'',
       ''吃晚餐:喝咖啡:喝茶:紧张的一天'', ''锻炼'',
       ''吃晚餐:喝咖啡:锻炼'], dtype=object)

这些是来自数据框中某列的唯一值,

正如你所看到的,它们是其他值的组合,比如 ''喝咖啡:喝茶'' 是 ''喝咖啡'' 和 ''喝茶'' 的组合。我想要这个列表的唯一元素。

有没有在Python库中针对这种情况的内置函数可以快速创建这个列表?

期望输出:

array([nan, ''紧张的一天'', ''喝咖啡'', ''喝茶'', ''吃晚餐'',
       ''锻炼'], dtype=object)
英文:

I have an array:

array([nan, 'Stressful day', 'Drank coffee:Drank tea', 'Drank tea',
       'Ate late:Drank coffee', 'Drank coffee:Drank tea:Worked out',
       'Drank tea:Worked out', 'Drank coffee:Drank tea:Stressful day',
       'Drank coffee', 'Drank coffee:Drank tea:Stressful day:Worked out',
       'Drank coffee:Worked out', 'Ate late:Drank coffee:Drank tea',
       'Ate late:Drank coffee:Drank tea:Worked out',
       'Drank tea:Stressful day', 'Drank tea:Stressful day:Worked out',
       'Drank coffee:Stressful day:Worked out',
       'Drank coffee:Stressful day',
       'Ate late:Drank coffee:Drank tea:Stressful day', 'Worked out',
       'Ate late:Drank coffee:Worked out'], dtype=object)

these are unique values from the column of a dataframe,

as you can see they are combination of other values like 'Drank coffee:Drank tea' is a combination of 'Drank coffee' and 'Drank tea'. I want those unique elements for this list.

What's the quickest way to create that list? Is there any inbuilt function in python libraries for this sort of thing?

Expected output:

array([nan, 'Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)

答案1

得分: 3

假设a是输入数组,你可以使用str.extractall

out = pd.Series(a).str.extractall('([^:]+)')[0].unique()

从原始 Series s

out = s.unique().drop_duplicates().str.extractall('([^:]+)')[0].unique()

输出:

array(['Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)

其他选项(可能效率较低):

out = set(x for s in a if isinstance(s, str) for x in s.split(':'))

out = pd.Series(a).str.split(':').explode().unique()

保留 NaN 值:

s = pd.Series(a)
out = np.concatenate(
展开收缩
.unique(),
s.str.extractall('([^:]+)')[0].unique()])

输出:

array([nan, 'Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)

或者:

out = set(x for s in a for x in (s.split(':') if isinstance(s, str) else 
展开收缩
))

输出:

{'Drank coffee', 'Drank tea', nan, 'Stressful day', 'Worked out', 'Ate late'}
英文:

Assuming a the input array, you could use str.extractall:

out = pd.Series(a).str.extractall('([^:]+)')[0].unique()

From the original Series s:

out = s.unique().drop_duplicates().str.extractall('([^:]+)')[0].unique()

Output:

array(['Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)

Other options (maybe less efficient):

out = set(x for s in a if isinstance(s, str) for x in s.split(':'))

out = pd.Series(a).str.split(':').explode().unique()
keeping NaNs:
s = pd.Series(a)
out = np.concatenate(
展开收缩
.unique(), s.str.extractall('([^:]+)')[0].unique()])

Output:

array([nan, 'Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)

Or:

out = set(x for s in a for x in (s.split(':') if isinstance(s, str) else 
展开收缩
))

Output:

{'Drank coffee', 'Drank tea', nan, 'Stressful day', 'Worked out', 'Ate late'}

答案2

得分: 2

以下是一个使用Python和NumPy的解决方案。

首先,使用列表而不是对象数据类型数组更简单(数组层对此代码没有任何影响):

alist = [np.nan, 'Stressful day', 'Drank coffee:Drank tea', 'Drank tea',
        'Ate late:Drank coffee', 'Drank coffee:Drank tea:Worked out',
        'Drank tea:Worked out', 'Drank coffee:Drank tea:Stressful day',
        'Drank coffee', 'Drank coffee:Drank tea:Stressful day:Worked out',
        'Drank coffee:Worked out', 'Ate late:Drank coffee:Drank tea',
        'Ate late:Drank coffee:Drank tea:Worked out',
        'Drank tea:Stressful day', 'Drank tea:Stressful day:Worked out',
        'Drank coffee:Stressful day:Worked out',
        'Drank coffee:Stressful day',
        'Ate late:Drank coffee:Drank tea:Stressful day', 'Worked out',
        'Ate late:Drank coffee:Worked out']

处理NaN是个问题,因为它是浮点数,而不是字符串:

blist = 
展开收缩

浮点数无法进行split操作,而字符串无法测试是否为浮点数值。因此,让我们创建一个实用函数来捕获错误:

def foo(astr):
    try:
        return astr.split(':')
    except AttributeError:
        return [astr]   # makes extend easier

然后使用此函数创建blist

blist = [foo(s) for s in alist]

接下来,使用extend来扁平化列表。你可以在blist创建时进行此操作:

clist = []
for l in blist:
    clist.extend(l)

然后,使用np.unique很容易进行唯一值处理:

u = np.unique(clist)

实际上,我们完全不需要NumPy,Python的集合(set)同样适用:

S = set(clist)

以上是你提供的代码的翻译。如果你有任何其他问题或需要进一步的帮助,请告诉我。

英文:

Here's a python plus numpy solution.

Starting with a list rather than an object dtype array is simpler (the array layer doesn't add anything to this code)

In [2]: alist =[np.nan, 'Stressful day', 'Drank coffee:Drank tea', 'Drank tea',
   ...:        'Ate late:Drank coffee', 'Drank coffee:Drank tea:Worked out',
   ...:        'Drank tea:Worked out', 'Drank coffee:Drank tea:Stressful day',
   ...:        'Drank coffee', 'Drank coffee:Drank tea:Stressful day:Worked out',
   ...:        'Drank coffee:Worked out', 'Ate late:Drank coffee:Drank tea',
   ...:        'Ate late:Drank coffee:Drank tea:Worked out',
   ...:        'Drank tea:Stressful day', 'Drank tea:Stressful day:Worked out',
   ...:        'Drank coffee:Stressful day:Worked out',
   ...:        'Drank coffee:Stressful day',
   ...:        'Ate late:Drank coffee:Drank tea:Stressful day', 'Worked out',
   ...:        'Ate late:Drank coffee:Worked out']

Handling the nan is problem, since it's a float, not a string:

In [3]: blist = 
展开收缩
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) In [4]: blist =
展开收缩
--------------------------------------------------------------------------- AttributeError: 'float' object has no attribute 'split'

A float can't `split', and string can't be tested for a float value. So let's make a utility function to catch the error.

In [10]: def foo(astr):
    ...:     try:
    ...:         return astr.split(':')
    ...:     except AttributeError:
    ...:         return [astr]   # makes extend easier
    ...:         

In [11]: blist = [foo(s) for s in alist]

In [12]: blist
Out[12]: 
[[nan],
 ['Stressful day'],
 ['Drank coffee', 'Drank tea'],
 ['Drank tea'],
 ['Ate late', 'Drank coffee'],
 ['Drank coffee', 'Drank tea', 'Worked out'],
 ['Drank tea', 'Worked out'],
 ['Drank coffee', 'Drank tea', 'Stressful day'],
 ['Drank coffee'],
 ['Drank coffee', 'Drank tea', 'Stressful day', 'Worked out'],
 ['Drank coffee', 'Worked out'],
 ...
 ['Worked out'],
 ['Ate late', 'Drank coffee', 'Worked out']]

And flatten the list with extend. I might have included this in the blist creation:

In [13]: clist = []
    ...: for l in blist:
    ...:     clist.extend(l)
    ...:     

In [14]: clist
Out[14]: 
[nan,
 'Stressful day',
 'Drank coffee',
 'Drank tea',
 'Drank tea',
 ...
 'Worked out',
 'Ate late',
 'Drank coffee',
 'Worked out']

Then it's easy to apply the np.unique.

In [15]: u = np.unique(clist)

In [16]: u
Out[16]: 
array(['Ate late', 'Drank coffee', 'Drank tea', 'Stressful day',
       'Worked out', 'nan'], dtype='<U32')

Actually we don't numpy at all, Python set will do just as well

In [17]: S = set(clist)
In [18]: S
Out[18]: {'Ate late', 'Drank coffee', 'Drank tea', 'Stressful day', 'Worked out', nan}

huangapple
  • 本文由 发表于 2023年6月16日 15:40:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76487970.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定