我正在使用 sparklyr
包裹。我使用 lubridate
包来计算两个日期之间的持续时间(以天为单位)。在r中,这导致 duration
然后可以转换为数字数据类型的数据类型,如下例所示。
# Load packages
library(sparklyr)
library(dplyr)
library(lubridate)
# Create dataframe with start and end date
df <- tibble(start = ymd("20210101"),
end = ymd("20210105"))
df
---
# A tibble: 1 x 2
start end
<date> <date>
1 2021-01-01 2021-01-05
---
# Calculate duration and convert to numeric using R dataframe
df %>%
mutate(dur = end - start,
dur_num = as.numeric(dur))
---
# A tibble: 1 x 4
start end dur dur_num
<date> <date> <drtn> <dbl>
1 2021-01-01 2021-01-05 4 days 4
---
在sparkDataframe上使用 sparklyr
将生成错误,因为持续时间数据类型会自动转换为字符串数据类型。下面的示例中显示了代码和错误。从本地r转为spark时,由于时区不同,请忽略日期的变化。
## Connect to local Spark cluster
sc <- spark_connect(master = "local", version = "3.0")
# Copy dataframe to Spark
df_spark <- copy_to(sc, df)
# Calculate duration using Spark dataframe
df_spark %>%
mutate(dur = end - start)
---
# Source: spark<?> [?? x 3]
start end dur
<date> <date> <chr>
1 2020-12-31 2021-01-04 4 days
---
# Calculate duration and convert to numeric using Spark dataframe
df_spark %>%
mutate(dur = end - start,
dur_num = as.numeric(dur))
---
Error: org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(q01.`dur` AS DOUBLE)' due to data type
mismatch: cannot cast interval to double; line 1 pos 30;
'Project [start#58, end#59, dur#280, cast(dur#280 as double) AS dur_num#281]
+- SubqueryAlias q01
+- Project [start#58, end#59, subtractdates(end#59, start#58) AS dur#280]
+- SubqueryAlias df
+- LogicalRDD [start#58, end#59], false
---
有没有可能使用 lubridate::duration
spark中的数据类型使用 sparklyr
? 如果没有,有没有办法绕过转换成字符串,结果天数是双倍的?感谢所有的帮助。
暂无答案!
目前还没有任何答案,快来回答吧!