numpy 根据为0或1的伯努利随机变量从数据框中删除观测

hsvhsicv 于 2023-01-26 发布在其他

关注(0)|答案(1)|浏览(119)

我有一个1000行的数据框。我想从一个特定的列Y中删除500个观测值，在某种程度上，Y的值越大，它被删除的可能性就越大。一种方法是以升序的方式对该列进行排序。对于i = 1,...,1000，丢弃具有p_i成功概率的伯努利随机变量，该成功概率取决于i。删除其伯努利随机变量为1的所有观察。
首先我对这一列进行排序：
df_sorted = df.sort_values("mycolumn")
接下来，我尝试了这样的方法：

p_i = np.linspace(0,1,num=sample_Encoded_sorted.shape[0])
bernoulli = np.random.binomial(1, p_i)
delete_index = bernoulli == 1

得到delete_index是True或False的布尔向量，且在高指标下得到True的概率较高，但得到的True超过500个。
如何在这个向量中只得到500个真值？如何使用它删除数据框中的相应行？
例如，如果delete_index中的i = 1为False，则不会删除 Dataframe 的第一行，如果为True，则会删除。

numpy

来源：https://stackoverflow.com/questions/75237664/deleting-observations-from-a-data-frame-according-to-a-bernoulli-random-variabl

1条答案

按热度按时间

l7wslrjt1#

我不知道为什么你试图限制出现的真和假500，由于随机二项式它将接近500，但大多数时候它不会是500，但这里是一个可能的解决方案，我不知道它对你的目的有多有用。

p_i = np.linspace(0,1,num=1000)

#We make a loop that make the number 1 appear 500 times
count=0
while count != 500:
  bernoulli = np.random.binomial(1, p_i)
  count=np.sum(bernoulli)

#We transform the array that we got from np.random.binomial into a boolean mask and slice the sorted_df
df_sorted=df_sorted[pd.Series(bernoulli)==0]
#This will return a new DataFrame with the 500 values

我希望这能有所帮助，我还不能发表评论，因为我的声誉，这就是为什么我把这作为一个答案，而不是一个评论。

赞(0）回复(0）举报 2023-01-26

我来回答

numpy 根据为0或1的伯努利随机变量从数据框中删除观测

1条答案

相关问题

热门标签

最新问答