在spark databricks和python udf中使用广播变量

myss37ts 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(377)

我使用https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
创建一个函数，该函数接收Dataframe的行并计算某些内容。

schema = StructType([
          StructField('sensorid', IntegerType(), True), StructField('sensortimestamp', LongType(), True), StructField('calculatedvalue', DoubleType(), True)
        ])
        @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
        def rawToSomthing(rawData):
          df = pd.DataFrame(columns=['sensorid','sensortimestamp','calculatedvalue'])
          sensor = rawData.sensorid[0]
          rawChunkSize=60*24
          rawData = rawData.sort_values(by=['sensortimestamp'], ascending=True)
          rawData = rawData.append(pullRef())
          return rawData

         def pullRef():
           return refDataframe
         DF_sqlRawData.groupby('sensorid').apply(rawToSomthing)

我想能够从“refdataframe”中提取每个spark作业中的数据。有没有办法用广播变量来实现这一点。如果是这样的话，我怎么会把它放到@pandas\u自定义项中呢

apache-spark pyspark user-defined-functions databricks Broadcast

来源：https://stackoverflow.com/questions/62883346/using-a-broadcast-variable-in-spark-databricks-with-python-udf

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

在spark databricks和python udf中使用广播变量

暂无答案！

相关问题

热门标签

最新问答