numpy 在python中对多维字符串数组降序排序而不改变原始数组顺序?

14ifxucb  于 2023-02-04  发布在  Python
关注(0)|答案(5)|浏览(145)

得到了一个如下所示的数组:

x = array([['PP Mango', 0.25, 0.75, 'PP'],
       ['PP Nectarine', 0.25, 0.75, 'PP'],
       ['Lemon', 0.25, 0.75, 'Loose'],
       ['PP Peach', 0.25, 0.75, 'PP'],
       ['Orange Navel', 0.25, 0.75, 'Loose'],
       ['PP Cherries', 0.25, 0.75, 'PP']], dtype=object)

我尝试将这个多维数组的第4个元素x[:,3]排序为降序***,而不改变原始行顺序***,其中x[:,3]是一个字符串(它将始终是'PP'或'Loose')。

    • 已试密码:**
x[x[:,3].argsort()][::-1]       #but this shuffles the original array row order within 4th element which should not happen
    • 预期产出:**
x = array([['PP Mango', 0.25, 0.75, 'PP'],
       ['PP Nectarine', 0.25, 0.75, 'PP'],
       ['PP Peach', 0.25, 0.75, 'PP'],
       ['PP Cherries', 0.25, 0.75, 'PP'],
       ['Lemon', 0.25, 0.75, 'Loose'],
       ['Orange Navel', 0.25, 0.75, 'Loose']], dtype=object)
1dkrff03

1dkrff031#

使用列表排序,只使用一个排序键比较第4个元素是否为"松散"。
python sort的稳定性保证了它不会在不需要的时候移动元素。

lst = [['PP Mango', 0.25, 0.75, 'PP'],
       ['PP Nectarine', 0.25, 0.75, 'PP'],
       ['Lemon', 0.25, 0.75, 'Loose'],
       ['PP Peach', 0.25, 0.75, 'PP'],
       ['Orange Navel', 0.25, 0.75, 'Loose'],
       ['PP Cherries', 0.25, 0.75, 'PP']]

lst = sorted(lst,key=lambda x:x[3]=='Loose')
print(lst)

印刷品:

[['PP Mango', 0.25, 0.75, 'PP'], 
['PP Nectarine', 0.25, 0.75, 'PP'], 
['PP Peach', 0.25, 0.75, 'PP'], 
['PP Cherries', 0.25, 0.75, 'PP'], ['Lemon', 0.25, 0.75, 'Loose'],
 ['Orange Navel', 0.25, 0.75, 'Loose']]

这是一个非numpy的解决方案,但它可以工作。然后转换回numpy数组:array(lst)

7hiiyaii

7hiiyaii2#

像前面的答案一样从NumPy到Python再回到NumPy似乎相当愚蠢/低效。NumPy也有稳定的排序,你只需要请求它:

x[x[:,3].argsort(kind='stable')]

请注意,我现在删除了您的冲销。结果是:

[['Lemon' 0.25 0.75 'Loose']
 ['Orange Navel' 0.25 0.75 'Loose']
 ['PP Mango' 0.25 0.75 'PP']
 ['PP Nectarine' 0.25 0.75 'PP']
 ['PP Peach' 0.25 0.75 'PP']
 ['PP Cherries' 0.25 0.75 'PP']]

现在我们如何在'Loose'之前得到'PP'呢?一种方法是像Jean-François did一样将列与'Loose'进行比较,因为这会将'Loose'变为True,将'PP'变为False,并且False小于True:

x[(x[:,3] == 'Loose').argsort(kind='stable')]

或者检查中的是否与“PP”相等:

x[(x[:,3] != 'PP').argsort(kind='stable')]

两者都返回所需的结果:

[['PP Mango' 0.25 0.75 'PP']
 ['PP Nectarine' 0.25 0.75 'PP']
 ['PP Peach' 0.25 0.75 'PP']
 ['PP Cherries' 0.25 0.75 'PP']
 ['Lemon' 0.25 0.75 'Loose']
 ['Orange Navel' 0.25 0.75 'Loose']]

还有一种方法是获取“PP”行,获取非“PP”行,并对它们进行vstack:

PP = x[:, 3] == 'PP'
np.vstack((x[PP], x[~PP]))

如果你有两个以上不同的值,或者未知的值,那么相等(不相等)的比较技巧就不够用了,那该怎么做呢?NumPy * 不像Python * 那样有一个漂亮的reverse=True,但是我们可以像Python一样来实现它,它在实际排序之前和之后都是反向的。在NumPy中有一种方法可以做到:

x[~x[::-1,3].argsort(kind='stable')[::-1]]

在网上试试吧!(所有四个解决方案)

xam8gpfp

xam8gpfp3#

对发布的解决方案进行基准测试,包括示例数组和大型数组(从示例中随机选择100000行)。

6 rows:
Kelly_revrev    4.2 μs ± 0.1 μs       5,424 bytes 
Puff            4.7 μs ± 0.1 μs         961 bytes 
Jean_François   4.9 μs ± 0.2 μs         961 bytes 
Kelly_Loose     6.0 μs ± 0.1 μs       5,424 bytes 
Kelly_PP        6.0 μs ± 0.2 μs       5,424 bytes 
Kelly_vstack   11.3 μs ± 0.2 μs       2,568 bytes 

100000 rows:
Kelly_PP       11.9 ms ± 0.1 ms   4,002,328 bytes 
Kelly_vstack   12.2 ms ± 0.3 ms   6,500,512 bytes 
Kelly_Loose    12.5 ms ± 0.4 ms   4,002,328 bytes 
Kelly_revrev   37.7 ms ± 1.1 ms   4,002,328 bytes 
Jean_François  71.5 ms ± 1.2 ms  13,700,112 bytes 
Puff           71.8 ms ± 0.7 ms  13,700,112 bytes 

Python: 3.7.4 (default, Jul  9 2019, 16:48:28) 
[GCC 8.3.1 20190223 (Red Hat 8.3.1-2)]
NumPy:  1.15.4

我不知道为什么我的NumPy解决方案在你的小例子中比其他解决方案占用更多的内存,而在大数组中占用更少的内存,我猜NumPy解决方案有一些小的常量开销。
完整代码(尝试在线!):

import numpy as np
from timeit import default_timer as time
from statistics import mean, stdev
import tracemalloc as tm
import sys

def Kelly_Loose(x):
    return x[(x[:,3] == 'Loose').argsort(kind='stable')]

def Kelly_PP(x):
    return x[(x[:,3] != 'PP').argsort(kind='stable')]

def Kelly_revrev(x):
    return x[~x[::-1,3].argsort(kind='stable')[::-1]]

def Kelly_vstack(x):
    PP = x[:, 3] == 'PP'
    return np.vstack((x[PP], x[~PP]))

def Jean_François(x):
    return np.array(sorted(x, key=lambda x: x[3] == 'Loose'))

def Puff(x):
    return np.array(sorted(x, key= lambda element: element[3], reverse=True))

funcs = Kelly_Loose, Kelly_PP, Kelly_revrev, Kelly_vstack, Jean_François, Puff

x = np.array([['PP Mango', 0.25, 0.75, 'PP'],
       ['PP Nectarine', 0.25, 0.75, 'PP'],
       ['Lemon', 0.25, 0.75, 'Loose'],
       ['PP Peach', 0.25, 0.75, 'PP'],
       ['Orange Navel', 0.25, 0.75, 'Loose'],
       ['PP Cherries', 0.25, 0.75, 'PP']], dtype=object)

X = x[np.random.randint(x.shape[0], size=10**5), :]

def test(x, reps, unit, scale):
  times = {f: [] for f in funcs}
  def stats(f):
    ts = [t * scale for t in sorted(times[f])[:5]]
    return f'{mean(ts):5.1f} {unit} ± {stdev(ts):3.1f} {unit}'

  for r in range(10):
    expect = None
    for f in funcs:
      t = time()
      for _ in range(reps):
        result = f(x)
      times[f].append((time() - t) / reps)
      if expect is None: expect = result
      assert (result == expect).all()

  print(len(x), 'rows:')
  for f in sorted(funcs, key=stats):
    del result
    tm.start()
    result = f(x)
    peak = tm.get_traced_memory()[1]
    tm.stop()
    print(f.__name__.ljust(13), stats(f), f'{peak:11,} bytes ')
  print()

test(x, 1000, 'μs', 1e6)
test(X, 1, 'ms', 1e3)

print('Python:', sys.version)
print('NumPy: ', np.__version__)
jv4diomz

jv4diomz4#

以下要求提供以下测试的完整结果无法作为评论发布,因此作为答案发布:
为了公平地比较所有方法的结果下面的排序时间,但是代码做了一些改动,所以纯Python方法不需要在numpy之间来回转换,看起来numpy可能不是处理小数组的正确方法。
对我来说仍然是一个惊喜的是,Kelly_vstack在100000行时没有赢得Kelly版本,在6行时真的很糟糕。

6 rows:
Puff            1.0 μs ± 0.0 μs         304 bytes 
Jean_François   1.1 μs ± 0.0 μs         304 bytes 
Kelly_revrev    3.8 μs ± 0.0 μs       5,624 bytes 
Kelly_Loose     7.2 μs ± 0.0 μs       5,582 bytes 
Kelly_PP        7.2 μs ± 0.1 μs       5,582 bytes 
Kelly_vstack   15.9 μs ± 0.1 μs       2,964 bytes 

100000 rows:
Kelly_PP        6.1 ms ± 0.0 ms   4,002,456 bytes 
Kelly_Loose     6.2 ms ± 0.0 ms   4,002,456 bytes 
Kelly_vstack    8.6 ms ± 0.0 ms   6,500,528 bytes 
Puff           14.9 ms ± 0.0 ms   1,865,936 bytes 
Jean_François  17.2 ms ± 0.1 ms   1,867,808 bytes 
Kelly_revrev   27.1 ms ± 0.0 ms   4,002,456 bytes 

Python: 3.9.13 (main, May 20 2022, 21:21:14) 
[GCC 5.4.1 20160904]
numpy 1.23.4

转换为numpy的前后时序如下:

6 rows:
Kelly_revrev    4.0 μs ± 0.0 μs       5,624 bytes 
Puff            4.7 μs ± 0.0 μs       1,120 bytes 
Jean_François   4.8 μs ± 0.0 μs       1,120 bytes 
Kelly_PP        7.0 μs ± 0.0 μs       5,582 bytes 
Kelly_Loose     7.1 μs ± 0.0 μs       5,582 bytes 
Kelly_vstack   15.7 μs ± 0.1 μs       2,964 bytes 

100000 rows:
Kelly_Loose     5.7 ms ± 0.0 ms   4,002,456 bytes 
Kelly_PP        5.8 ms ± 0.0 ms   4,002,456 bytes 
Kelly_vstack    8.2 ms ± 0.0 ms   6,500,528 bytes 
Kelly_revrev   26.2 ms ± 0.0 ms   4,002,456 bytes 
Puff           67.3 ms ± 0.9 ms  18,399,968 bytes 
Jean_François  69.1 ms ± 0.1 ms  18,399,968 bytes

表明对于小阵列,前向和后向转换执行得与Numpy解决方案相当好。
为了真正全面地了解下面的主题,使用.sort()代替sorted()的纯Python就地排序的计时,与numpy相比,在一个非常大的数组的情况下,显示了纯Python版本计时的较小差异,以及对于小数组的出色计时(比最好的numpy版本快4倍):

6 rows:
Puff            0.9 μs ± 0.0 μs         208 bytes 
Jean_François   1.0 μs ± 0.0 μs         208 bytes 
Kelly_revrev    3.9 μs ± 0.0 μs       5,624 bytes 
Kelly_PP        7.0 μs ± 0.0 μs       5,582 bytes 
Kelly_Loose     7.2 μs ± 0.0 μs       5,582 bytes 
Kelly_vstack   15.2 μs ± 0.1 μs       2,964 bytes 

100000 rows:
Kelly_PP        6.2 ms ± 0.0 ms   4,002,456 bytes 
Kelly_Loose     6.3 ms ± 0.0 ms   4,002,456 bytes 
Kelly_vstack    8.4 ms ± 0.1 ms   6,500,528 bytes 
Puff            9.2 ms ± 0.0 ms     800,208 bytes 
Jean_François  11.4 ms ± 0.0 ms     800,208 bytes 
Kelly_revrev   27.6 ms ± 0.1 ms   4,002,456 bytes

下面是用于获得上述结果的完整代码,这些代码可以很容易地用于显示使用和不使用Python/numpy转换进行排序的时间:

import numpy as np
from timeit import default_timer as time
from statistics import mean, stdev
import tracemalloc as tm

def Kelly_Loose(x):
    return x[(x[:,3] == 'Loose').argsort(kind='stable')]

def Kelly_PP(x):
    return x[(x[:,3] != 'PP').argsort(kind='stable')]

def Kelly_revrev(x):
    return x[~x[::-1,3].argsort(kind='stable')[::-1]]

def Kelly_vstack(x):
    PP = x[:, 3] == 'PP'
    return np.vstack((x[PP], x[~PP]))
"""
def Jean_François(x):
    return np.array(sorted(x, key=lambda x: x[3] == 'Loose'))

def Puff(x):
    return np.array(sorted(x, key=lambda element: element[3], reverse=True))
"""
def Jean_François(x):
    return sorted(x, key=lambda x: x[3] == 'Loose')

def Puff(x):
    return sorted(x, key=lambda element: element[3], reverse=True)
"""
def Jean_François(lst):
    lst.sort(key=lambda row: row[3]=='Loose')
    return lst

def Puff(lst):
    lst.sort(key=lambda row: row[3], reverse=True)
    return lst
#"""

funcs = Kelly_Loose, Kelly_PP, Kelly_revrev, Kelly_vstack, Jean_François, Puff

x = np.array(
      [['PP Mango', 0.25, 0.75, 'PP'],
       ['PP Nectarine', 0.25, 0.75, 'PP'],
       ['Lemon', 0.25, 0.75, 'Loose'],
       ['PP Peach', 0.25, 0.75, 'PP'],
       ['Orange Navel', 0.25, 0.75, 'Loose'],
       ['PP Cherries', 0.25, 0.75, 'PP']], dtype=object)
X = x[np.random.randint(x.shape[0], size=10**5), :]
#print(len(X)) 

lst = [['PP Mango', 0.25, 0.75, 'PP'],
       ['PP Nectarine', 0.25, 0.75, 'PP'],
       ['Lemon', 0.25, 0.75, 'Loose'],
       ['PP Peach', 0.25, 0.75, 'PP'],
       ['Orange Navel', 0.25, 0.75, 'Loose'],
       ['PP Cherries', 0.25, 0.75, 'PP']]
LST = (10**5//6)*lst+lst[0:4]
#print(len(LST))

def test(x, reps, unit, scale):
  bckpX = x
  times = {f: [] for f in funcs}
  def stats(f):
    ts = [t * scale for t in sorted(times[f])[:5]]
    return f'{mean(ts):5.1f} {unit} ± {stdev(ts):3.1f} {unit}'

  for r in range(10):
    expect = None
    for f in funcs:
      # """
      if f in [Jean_François, Puff]:
          if reps==1:
              x = LST
          else:
              x = lst
      else:
          x = bckpX
    #"""
      t = time()
      for _ in range(reps):
        result = f(x)
      times[f].append((time() - t) / reps)
      #if expect is None: expect = result
      #assert (result == expect).all()

  print(len(x), 'rows:')
  #print(funcs)
  for f in sorted(funcs, key=stats):
    del result
    # """
    if f in [Jean_François, Puff]:
        if reps==1:
            x = LST
        else:
            x = lst
    else:
        x = bckpX
    #"""
    tm.start()
    result = f(x)
    peak = tm.get_traced_memory()[1]
    tm.stop()
    print(f.__name__.ljust(13), stats(f), f'{peak:11,} bytes ')
  print()

test(x, 1000, 'μs', 1e6)
test(X,    1, 'ms', 1e3)
import sys
print('Python:', sys.version)
print('numpy', np.version.version)
af7jpaap

af7jpaap5#

你只需要使用一个排序键,即指定排序算法在尝试对数组元素进行排序时要比较什么。这里你希望键是每行的最后一个元素,所以:

np.array(sorted(x, key= lambda element: element[3], reverse=True))

应该会给予你想要的输出。注意使用reverse=True来获得相反顺序的数据。

相关问题