我尝试运行sparksql语句,并尝试在执行聚合时执行一个简单的group by;它抱怨在我在模式中提供的给定输入列中找不到month列,但是根据教程,它们能够运行给定的代码。
代码:
StructField[] fields = new StructField[]{
new StructField("level", DataTypes.StringType, false, Metadata.empty()),
new StructField("datetime", DataTypes.StringType, false, Metadata.empty())
};
StructType schema = new StructType(fields);
Dataset<Row> dateSet = spark.createDataFrame(inMemory, schema);
dateSet.createOrReplaceTempView("logging_level");
Dataset<Row> results = spark.sql("select level, date_format(datetime, 'MMMM') as month, count(1) as total from logging_level group by level, month");
堆栈跟踪:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`month`' given input columns: [level, datetime]; line 1 pos 107
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)....
1条答案
按热度按时间w1jd8yoj1#
不能重用在中定义的别名
select
合同条款group by
条款。您需要重复以下表达式:注意,我替换了
count(1)
与count(*)
:它效率更高,并提供相同的结果。许多数据库支持位置参数。我认为spark就是其中之一,所以: