numpy 在同一个列表中用一些分隔符分割列表中的元素

l2osamch  于 2023-06-23  发布在  其他
关注(0)|答案(2)|浏览(88)

我有一个数组:

array([nan, 'Stressful day', 'Drank coffee:Drank tea', 'Drank tea',
       'Ate late:Drank coffee', 'Drank coffee:Drank tea:Worked out',
       'Drank tea:Worked out', 'Drank coffee:Drank tea:Stressful day',
       'Drank coffee', 'Drank coffee:Drank tea:Stressful day:Worked out',
       'Drank coffee:Worked out', 'Ate late:Drank coffee:Drank tea',
       'Ate late:Drank coffee:Drank tea:Worked out',
       'Drank tea:Stressful day', 'Drank tea:Stressful day:Worked out',
       'Drank coffee:Stressful day:Worked out',
       'Drank coffee:Stressful day',
       'Ate late:Drank coffee:Drank tea:Stressful day', 'Worked out',
       'Ate late:Drank coffee:Worked out'], dtype=object)

这些是来自 Dataframe 的列的唯一值,
正如你所看到的,它们是其他值的组合,如“Drank coffee:Drank tea”是“Drank coffee”和“Drank tea”的组合。我想把这些独特的元素写进这份名单。
创建该列表的最快方法是什么?python库中有没有内置的函数来处理这类事情?
预期输出:

array([nan, 'Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)
w8biq8rn

w8biq8rn1#

假设a是输入数组,你可以使用str.extractall

out = pd.Series(a).str.extractall('([^:]+)')[0].unique()

从原始系列s

out = s.unique().drop_duplicates().str.extractall('([^:]+)')[0].unique()

输出:

array(['Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)

其他选项(可能效率较低):

out = set(x for s in a if isinstance(s, str) for x in s.split(':'))

out = pd.Series(a).str.split(':').explode().unique()
保留NaNs:
s = pd.Series(a)
out = np.concatenate([s[s.isna()].unique(),
                      s.str.extractall('([^:]+)')[0].unique()])

输出:

array([nan, 'Stressful day', 'Drank coffee', 'Drank tea', 'Ate late',
       'Worked out'], dtype=object)

或者:

out = set(x for s in a for x in (s.split(':') if isinstance(s, str) else [s]))

输出:

{'Drank coffee', 'Drank tea', nan, 'Stressful day', 'Worked out', 'Ate late'}
iecba09b

iecba09b2#

这是一个python加numpy的解决方案。
从列表而不是对象dtype数组开始更简单(数组层不会向此代码添加任何内容)

In [2]: alist =[np.nan, 'Stressful day', 'Drank coffee:Drank tea', 'Drank tea',
   ...:        'Ate late:Drank coffee', 'Drank coffee:Drank tea:Worked out',
   ...:        'Drank tea:Worked out', 'Drank coffee:Drank tea:Stressful day',
   ...:        'Drank coffee', 'Drank coffee:Drank tea:Stressful day:Worked out',
   ...:        'Drank coffee:Worked out', 'Ate late:Drank coffee:Drank tea',
   ...:        'Ate late:Drank coffee:Drank tea:Worked out',
   ...:        'Drank tea:Stressful day', 'Drank tea:Stressful day:Worked out',
   ...:        'Drank coffee:Stressful day:Worked out',
   ...:        'Drank coffee:Stressful day',
   ...:        'Ate late:Drank coffee:Drank tea:Stressful day', 'Worked out',
   ...:        'Ate late:Drank coffee:Worked out']

处理nan是个问题,因为它是一个float,而不是一个string:

In [3]: blist = [s.split(':') for s in alist if not np.isnan(s)]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

In [4]: blist = [s.split(':') for s in alist]
---------------------------------------------------------------------------
AttributeError: 'float' object has no attribute 'split'

float不能“split”,string不能被测试为float值。因此,让我们创建一个实用函数来捕获错误。

In [10]: def foo(astr):
    ...:     try:
    ...:         return astr.split(':')
    ...:     except AttributeError:
    ...:         return [astr]   # makes extend easier
    ...:         

In [11]: blist = [foo(s) for s in alist]

In [12]: blist
Out[12]: 
[[nan],
 ['Stressful day'],
 ['Drank coffee', 'Drank tea'],
 ['Drank tea'],
 ['Ate late', 'Drank coffee'],
 ['Drank coffee', 'Drank tea', 'Worked out'],
 ['Drank tea', 'Worked out'],
 ['Drank coffee', 'Drank tea', 'Stressful day'],
 ['Drank coffee'],
 ['Drank coffee', 'Drank tea', 'Stressful day', 'Worked out'],
 ['Drank coffee', 'Worked out'],
 ...
 ['Worked out'],
 ['Ate late', 'Drank coffee', 'Worked out']]

然后用extend使列表变平。我可能在blist创建中包含了以下内容:

In [13]: clist = []
    ...: for l in blist:
    ...:     clist.extend(l)
    ...:     

In [14]: clist
Out[14]: 
[nan,
 'Stressful day',
 'Drank coffee',
 'Drank tea',
 'Drank tea',
 ...
 'Worked out',
 'Ate late',
 'Drank coffee',
 'Worked out']

然后很容易应用np.unique

In [15]: u = np.unique(clist)

In [16]: u
Out[16]: 
array(['Ate late', 'Drank coffee', 'Drank tea', 'Stressful day',
       'Worked out', 'nan'], dtype='<U32')

实际上我们一点也不麻木,Python集也能做得很好

In [17]: S = set(clist)
In [18]: S
Out[18]: {'Ate late', 'Drank coffee', 'Drank tea', 'Stressful day', 'Worked out', nan}

相关问题