python numpy加权平均值

lxkprmvk  于 2023-05-07  发布在  Python
关注(0)|答案(6)|浏览(209)

先说重要的事这不是NumPy: calculate averages with NaNs removed的副本,我会解释为什么:
假设我有一个数组

a = array([1,2,3,4])

我想用权重求平均值

weights = [4,3,2,1]
output = average(a, weights=weights)
print output
     2.0

好的。这很简单。但现在我有这样的东西:

a = array([1,2,nan,4])

用通常的方法计算平均值当然产生nan。我能避免这种情况吗?原则上,我想忽略nans,所以我想有这样的东西:

a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
     1.75
rhfm7lfc

rhfm7lfc1#

或者,您可以像这样使用MaskedArray:

>>> import numpy as np

>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75
esbemjvw

esbemjvw2#

首先找出不是nan的索引,然后将aweights的过滤版本传递给numpy.average

>>> import numpy as np
>>> a = np.array([1, 2, np.nan,4])
>>> weights = np.array([4, 3, 2, 1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75

正如@mtrw在评论中建议的那样,在这里使用掩码数组而不是索引数组会更干净:

>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75
au9on6nz

au9on6nz3#

我会提供另一种解决方案,这是更可扩展到更大的尺寸(例如,当做平均在不同的轴)。附加的代码与2D数组一起工作,其中可能包含nans,并且平均超过axis=0

a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array

# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a)) 

# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)                                                         
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec

# mean_vec is vector with weighted nan-averages of array a taken along axis=0
vfhzx4xs

vfhzx4xs4#

扩展@Ashwini和@Nicolas的答案,这里有一个版本,也可以处理所有数据值都是np.nan的边缘情况,并且设计用于pandas DataFrame而没有类型相关的问题:

def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
                       weights: List[Union[float, int]]) -> np.ndarray:
    """ Calculates the weighted average of `measures`' values, ex-nans.

    When nans are present in  `measures`' values,
    the weights are recalculated based only on the weights for non-nan measures.

    Note:
        The calculation used is NOT the same as just ignoring nans.
        For example, if we had data and weights:
            data = [2, 3, np.nan]
            weights = [0.5, 0.2, 0.3]
            calc_wa_ignore_nan approach:
                (2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
            The ignoring nans approach:
                (2*0.5) + (3*0.2) == 1.6

    Args:
        data: Multiple rows of numeric data values with `measures` as column headers.
        measures: The str names of values to select from `row`.
        weights: The numeric weights associated with `measures`.

    Example:
        >>> df = pd.DataFrame({"meas1": [1, 1],
                               "meas2": [2, 2],
                               "meas3": [3, 3],
                               "meas4": [np.nan, 0],
                               "meas5": [5, 5]})
        >>> measures = ["meas2", "meas3", "meas4"]
        >>> weights = [0.5, 0.2, 0.3]
        >>> calc_wa_ignore_nan(df, measures, weights)
        array([2.28571429, 1.6])

    """
    assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
    # Need to coerce type to np.float instead of python's float
    # to avoid "ufunc 'isnan' not supported for the input types ..." error
    data = np.array(df[measures].values, dtype=np.float64)

    # Make a 2d array with the same weights for each row
    # cast for safety and better errors
    weights = np.array([weights, ] * data.shape[0], dtype=np.float64)

    mask = np.isnan(data)
    masked_data = np.ma.masked_array(data, mask=mask)
    masked_weights = np.ma.masked_array(weights, mask=mask)

    # np.nanmean doesn't support weights
    weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
    # Replace masked elements with np.nan
    # otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
    weighted_avgs = weighted_avgs.filled(np.nan)

    return weighted_avgs
vtwuwzda

vtwuwzda5#

上面的所有解决方案都很好,但没有处理权值中有nan的情况。为此,使用pandas:

def weighted_average_ignoring_nan(df, col_value, col_weight):
  den = 0
  num = 0
  for index, row in df.iterrows():
    if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
      den = den + row[col_weight]
      num = num + row[col_weight]*row[col_value]
  return num/den
oewdyzsn

oewdyzsn6#

既然你在寻找平均值,另一个想法是简单地用0替换所有的nan值:

>>>import numpy as np
>>>a = np.array([[ 3.,  2.,  5.], [np.nan,  4., np.nan], [np.nan, np.nan, np.nan]])
>>>w = np.array([[ 1.,  2.,  3.], [np.nan, np.nan, np.nan], [np.nan, np.nan, np.nan]])
>>>a[np.isnan(a)] = 0
>>>w[np.isnan(w)] = 0
>>>np.average(a, weights=w)
3.6666666666666665

这可以与平均函数的轴功能一起使用,但要小心,您的权重总和不为0。

相关问题