英文:
Splitting the elements of a list by some separator in the same list
问题
我有一个数组:
```python
array([nan, ''紧张的一天'', ''喝咖啡:喝茶'', ''喝茶'',
''吃晚餐:喝咖啡'', ''喝咖啡:喝茶:锻炼'', ''喝茶:锻炼'',
''喝咖啡:喝茶:紧张的一天'', ''喝咖啡'',
''喝咖啡:喝茶:紧张的一天:锻炼'', ''喝咖啡:锻炼'',
''吃晚餐:喝咖啡:喝茶'', ''吃晚餐:喝咖啡:喝茶:锻炼'',
''喝茶:紧张的一天'', ''喝茶:紧张的一天:锻炼'',
''喝咖啡:紧张的一天:锻炼'', ''喝咖啡:紧张的一天'',
''吃晚餐:喝咖啡:喝茶:紧张的一天'', ''锻炼'',
''吃晚餐:喝咖啡:锻炼'], dtype=object)
这些是来自数据框中某列的唯一值,
正如你所看到的,它们是其他值的组合,比如 ''喝咖啡:喝茶'' 是 ''喝咖啡'' 和 ''喝茶'' 的组合。我想要这个列表的唯一元素。
有没有在Python库中针对这种情况的内置函数可以快速创建这个列表?
期望输出:
array([nan, ''紧张的一天'', ''喝咖啡'', ''喝茶'', ''吃晚餐'',
''锻炼'], dtype=object)
英文:
I have an array:
array([nan, 'Stressful day', 'Drank coffee:Drank tea', 'Drank tea',
'Ate late:Drank coffee', 'Drank coffee:Drank tea:Worked out',
'Drank tea:Worked out', 'Drank coffee:Drank tea:Stressful day',
'Drank coffee', 'Drank coffee:Drank tea:Stressful day:Worked out',
'Drank coffee:Worked out', 'Ate late:Drank coffee:Drank tea',
'Ate late:Drank coffee:Drank tea:Worked out',
'Drank tea:Stressful day', 'Drank tea:Stressful day:Worked out',
'Drank coffee:Stressful day:Worked out',
'Drank coffee:Stressful day',
'Ate late:Drank coffee:Drank tea:Stressful day', 'Worked out',
'Ate late:Drank coffee:Worked out'], dtype=object)
these are unique values from the column of a dataframe,
as you can see they are combination of other values like 'Drank coffee:Drank tea' is a combination of 'Drank coffee' and 'Drank tea'. I want those unique elements for this list.
What's the quickest way to create that list? Is there any inbuilt function in python libraries for this sort of thing?
Expected output:
array([nan, 'Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
'Worked out'], dtype=object)
答案1
得分: 3
假设a
是输入数组,你可以使用str.extractall
:
out = pd.Series(a).str.extractall('([^:]+)')[0].unique()
从原始 Series s
:
out = s.unique().drop_duplicates().str.extractall('([^:]+)')[0].unique()
输出:
array(['Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
'Worked out'], dtype=object)
其他选项(可能效率较低):
out = set(x for s in a if isinstance(s, str) for x in s.split(':'))
out = pd.Series(a).str.split(':').explode().unique()
保留 NaN 值:
s = pd.Series(a)
out = np.concatenate(展开收缩.unique(),
s.str.extractall('([^:]+)')[0].unique()])
输出:
array([nan, 'Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
'Worked out'], dtype=object)
或者:
out = set(x for s in a for x in (s.split(':') if isinstance(s, str) else 展开收缩))
输出:
{'Drank coffee', 'Drank tea', nan, 'Stressful day', 'Worked out', 'Ate late'}
英文:
Assuming a
the input array, you could use str.extractall
:
out = pd.Series(a).str.extractall('([^:]+)')[0].unique()
From the original Series s
:
out = s.unique().drop_duplicates().str.extractall('([^:]+)')[0].unique()
Output:
array(['Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
'Worked out'], dtype=object)
Other options (maybe less efficient):
out = set(x for s in a if isinstance(s, str) for x in s.split(':'))
out = pd.Series(a).str.split(':').explode().unique()
keeping NaNs:
s = pd.Series(a)
out = np.concatenate(展开收缩.unique(),
s.str.extractall('([^:]+)')[0].unique()])
Output:
array([nan, 'Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
'Worked out'], dtype=object)
Or:
out = set(x for s in a for x in (s.split(':') if isinstance(s, str) else 展开收缩))
Output:
{'Drank coffee', 'Drank tea', nan, 'Stressful day', 'Worked out', 'Ate late'}
答案2
得分: 2
以下是一个使用Python和NumPy的解决方案。
首先,使用列表而不是对象数据类型数组更简单(数组层对此代码没有任何影响):
alist = [np.nan, 'Stressful day', 'Drank coffee:Drank tea', 'Drank tea',
'Ate late:Drank coffee', 'Drank coffee:Drank tea:Worked out',
'Drank tea:Worked out', 'Drank coffee:Drank tea:Stressful day',
'Drank coffee', 'Drank coffee:Drank tea:Stressful day:Worked out',
'Drank coffee:Worked out', 'Ate late:Drank coffee:Drank tea',
'Ate late:Drank coffee:Drank tea:Worked out',
'Drank tea:Stressful day', 'Drank tea:Stressful day:Worked out',
'Drank coffee:Stressful day:Worked out',
'Drank coffee:Stressful day',
'Ate late:Drank coffee:Drank tea:Stressful day', 'Worked out',
'Ate late:Drank coffee:Worked out']
处理NaN是个问题,因为它是浮点数,而不是字符串:
blist = 展开收缩
浮点数无法进行split
操作,而字符串无法测试是否为浮点数值。因此,让我们创建一个实用函数来捕获错误:
def foo(astr):
try:
return astr.split(':')
except AttributeError:
return [astr] # makes extend easier
然后使用此函数创建blist
:
blist = [foo(s) for s in alist]
接下来,使用extend
来扁平化列表。你可以在blist
创建时进行此操作:
clist = []
for l in blist:
clist.extend(l)
然后,使用np.unique
很容易进行唯一值处理:
u = np.unique(clist)
实际上,我们完全不需要NumPy,Python的集合(set)同样适用:
S = set(clist)
以上是你提供的代码的翻译。如果你有任何其他问题或需要进一步的帮助,请告诉我。
英文:
Here's a python plus numpy solution.
Starting with a list rather than an object dtype array is simpler (the array layer doesn't add anything to this code)
In [2]: alist =[np.nan, 'Stressful day', 'Drank coffee:Drank tea', 'Drank tea',
...: 'Ate late:Drank coffee', 'Drank coffee:Drank tea:Worked out',
...: 'Drank tea:Worked out', 'Drank coffee:Drank tea:Stressful day',
...: 'Drank coffee', 'Drank coffee:Drank tea:Stressful day:Worked out',
...: 'Drank coffee:Worked out', 'Ate late:Drank coffee:Drank tea',
...: 'Ate late:Drank coffee:Drank tea:Worked out',
...: 'Drank tea:Stressful day', 'Drank tea:Stressful day:Worked out',
...: 'Drank coffee:Stressful day:Worked out',
...: 'Drank coffee:Stressful day',
...: 'Ate late:Drank coffee:Drank tea:Stressful day', 'Worked out',
...: 'Ate late:Drank coffee:Worked out']
Handling the nan is problem, since it's a float, not a string:
In [3]: blist = 展开收缩
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
In [4]: blist = 展开收缩
---------------------------------------------------------------------------
AttributeError: 'float' object has no attribute 'split'
A float can't `split', and string can't be tested for a float value. So let's make a utility function to catch the error.
In [10]: def foo(astr):
...: try:
...: return astr.split(':')
...: except AttributeError:
...: return [astr] # makes extend easier
...:
In [11]: blist = [foo(s) for s in alist]
In [12]: blist
Out[12]:
[[nan],
['Stressful day'],
['Drank coffee', 'Drank tea'],
['Drank tea'],
['Ate late', 'Drank coffee'],
['Drank coffee', 'Drank tea', 'Worked out'],
['Drank tea', 'Worked out'],
['Drank coffee', 'Drank tea', 'Stressful day'],
['Drank coffee'],
['Drank coffee', 'Drank tea', 'Stressful day', 'Worked out'],
['Drank coffee', 'Worked out'],
...
['Worked out'],
['Ate late', 'Drank coffee', 'Worked out']]
And flatten the list with extend. I might have included this in the blist creation:
In [13]: clist = []
...: for l in blist:
...: clist.extend(l)
...:
In [14]: clist
Out[14]:
[nan,
'Stressful day',
'Drank coffee',
'Drank tea',
'Drank tea',
...
'Worked out',
'Ate late',
'Drank coffee',
'Worked out']
Then it's easy to apply the np.unique
.
In [15]: u = np.unique(clist)
In [16]: u
Out[16]:
array(['Ate late', 'Drank coffee', 'Drank tea', 'Stressful day',
'Worked out', 'nan'], dtype='<U32')
Actually we don't numpy at all, Python set will do just as well
In [17]: S = set(clist)
In [18]: S
Out[18]: {'Ate late', 'Drank coffee', 'Drank tea', 'Stressful day', 'Worked out', nan}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论