numpy python中的np.std从加权样本中得出的结果与理论上的不同?

qzlgjiam  于 11个月前  发布在  Python
关注(0)|答案(1)|浏览(133)

我从三个具有设定权重的分布中进行采样,并交叉检查采样是否按预期工作。(每个分布的权重 * 平均值)是相同的,它是,标准差也是相同的。然而,我似乎不能得到标准差匹配?我不知道为什么,因为我认为采样是按预期工作,但需要帮助弄清楚为什么std_generated和expected_std是不同的

import numpy as np
from scipy.stats import expon
np.random.seed(42)  # Set a specific seed value

# Assuming disbn_1, disbn_2, and disbn_3 are your frozen distribution objects
# and w_1, w_2, w_3 are the corresponding probabilities

# Set up your distribution objects
par_1 = 2.0  # Adjust this parameter according to your specific distribution
disbn_1 = expon(scale=par_1)
disbn_2 = expon(scale=1.0)  # Adjust parameters as needed
disbn_3 = expon(scale=0.5)  # Adjust parameters as needed

# Set up probabilities
w_1, w_2, w_3 = 0.4, 0.3, 0.3
weights = np.array([w_1, w_2, w_3])

# Number of samples you want to generate
num_samples = 5000000  # Increase the number of samples

# Sample from the distributions with specified probabilities
selected_distributions = [disbn_1, disbn_2, disbn_3]
random_indices = np.random.choice(len(weights), size=num_samples, p=weights)

# Use np.choose to select samples based on random indices
selected_samples = np.choose(random_indices, [dist.rvs(size=num_samples) for dist in selected_distributions])

# Calculate summary statistics
mean_generated = np.mean(selected_samples)
std_generated = np.std(selected_samples,ddof=0) 

# Calculate expected mean and std based on the specified distribution and probabilities
expected_mean = np.sum(weights * np.array([dist.mean() for dist in selected_distributions]))
expected_std = np.sqrt(np.sum(weights * np.array([dist.var() for dist in selected_distributions])))

# Display results
print(f"Mean of generated samples: {mean_generated:.6f}")
print(f"Expected mean based on distribution: {expected_mean:.6f}")
print()
print(f"Standard deviation of generated samples: {std_generated:.6f}")
print(f"Expected standard deviation based on distribution: {expected_std:.6f}")

字符串

yuvru6vn

yuvru6vn1#

计算标准差的方法没有考虑分布之间的不同均值。
这里有一个例子。
假设你有两个正态分布,N(0,1)和N(10,1),你以相等的概率从这两个分布中抽取样本来创建一个样本。


的数据
如果将公式应用于此样本,则得到sqrt(0.5 * var(dist1)+ 0.5 * var(dist2)),等于1。然而,真实的标准差略大于5。因此,很明显这里缺少了一部分。
所以,我开始寻找一种方法来获得样本的标准差,知道标准差和其分量分布的均值。
我找到了下面的MathOverflow post. How do I combine standard deviations of two groups?
这个答案适用于2个组,但只要做一点点工作,我们就可以使它适用于任意数量的组。

def combined_std(means, stds, samples_per_dist):
    means = np.array(means)
    stds = np.array(stds)
    samples_per_dist = np.array(samples_per_dist)
    assert means.shape == stds.shape == samples_per_dist.shape
    proportions = samples_per_dist / samples_per_dist.sum()
    combined_mean = np.average(means, weights=proportions)
    intra_distribution_var = ((samples_per_dist - 1) * (stds ** 2)).sum() / (samples_per_dist.sum() - 1)
    inter_distribution_var = (samples_per_dist * ((means - combined_mean) ** 2)).sum() / (samples_per_dist.sum() - 1)
    combined_var = intra_distribution_var + inter_distribution_var
    combined_std = np.sqrt(combined_var)
    return combined_std

字符串
该函数的调用方式如下:

means = [dist.mean() for dist in selected_distributions]
stds = [dist.std() for dist in selected_distributions]
samples_per_dist = num_samples * weights
expected_std = combined_std(means, stds, samples_per_dist)


inter_distribution_var项校正分布之间的均值差异。
将其应用于您的分布,我得到的预期标准差为1.545154,这与您的实验结果相当接近。

相关问题