python 在bigquery中为累积分布图生成数据

wtlkbnrh 于 2023-02-15 发布在 Python

关注(0)|答案(1)|浏览(134)

我想生成一个累积图，显示给定值下小于或等于该值的数据的百分比，在python / matplotlib / pandas中，我可以使用pandas提供的分位数函数（我猜numpy也可以）来实现这一点：

import numpy as np 
import matplotlib.pyplot as plt 

def plot_quantile(series, start=0., end=0.99):
    y = np.linspace(start,end, 500)
    x = series.quantile(y)
    plt.plot(x, y*100)
    plt.xlabel("Duration in minutes")
    plt.ylabel("% of drives less than x minutes") 
    plt.grid()
    
plot_quantile(df.duration)

在本例中，我绘制了the NYC taxi dataset的出租车乘坐时间分布。

我想用bigquery的SQL查询生成类似的数据，我非常接近下面的查询：

select
        approx_quantiles(duration, 100) as duration_quantile
    from base_table

这给了我101个数据点，从最小值开始，到最大值结束。现在我有两个问题：

我不知道这些值是如何对应分位数的（例如，哪个值是P50？）-我也需要为绘图生成这些数字，正如您在python代码中看到的。
我似乎没有办法在顶部附近截断它--因为最大值很可能是一个非常大的离群值，这使得我的图很难阅读。

python

来源：https://stackoverflow.com/questions/75360256/producing-data-for-a-cumulative-distribution-plot-in-bigquery

1条答案

按热度按时间

gupuwyp21#

下面可能不是你想要的，但希望它对你有帮助。

SELECT min,
       SUM(cnt) OVER (ORDER BY min) cumulative_cnt,
       ROUND(SUM(cnt) OVER (ORDER BY min) / SUM(cnt) OVER (), 4) cumulative_pct,
  FROM (
    -- trip count per duration in minutes
    SELECT ROUND(trip_seconds/60) min, COUNT(1) cnt,
      FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
     GROUP BY 1
  );

我在bigquery-public-project中使用了另一个数据集，而不是 * NYC出租车数据集 *。

以上查询显示了累积分布，如Looker Studio中的以下查询所示。

我似乎没有办法在顶部附近截断它--因为最大值很可能是一个非常大的离群值，这使得我的图很难阅读。
是的，像你说的那样不容易读。
Looker Studio提供了一个自定义的过滤器，如果你把它应用到图表上，你可以得到一个更平滑的曲线。

如果您希望在查询中使用QUALIFY条件，可以在上述查询的底部添加QUALIFY条件，而不是在LookerStudio中进行过滤。

SELECT min,
       SUM(cnt) OVER (ORDER BY min) cumulative_cnt,
       ROUND(SUM(cnt) OVER (ORDER BY min) / SUM(cnt) OVER (), 4) cumulative_pct,
  FROM (
    SELECT ROUND(trip_seconds/60) min, COUNT(1) cnt,
      FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
     GROUP BY 1
  ) QUALIFY cumulative_pct < 0.99;

赞(0）回复(0）举报 2023-02-15

我来回答

python 在bigquery中为累积分布图生成数据

1条答案

相关问题

热门标签

最新问答