我对spark和sql都是新手。我正在尝试使用sparkscala执行sql查询。以下是sql查询:
SELECT
a.*
FROM
( SELECT
a1.id_bu,
a1.nama,
a1.id_Bentuk_bu,
a1.id_bentuk_usaha,
a1.id_Jenis_bu,
a1.id_Jenis_bu_kbli,
a1.alamat,
a1.kodepos,
a1.telepon,
a1.fax,
a1.email,
a1.website,
a1.id_kabupaten,
a1.id_propinsi,
a1.npwp,
a1.no_spt as modal_dasar,
a1.log,
a2.bu_nomor
FROM
bu a1,
bu_nomor a2
where
a1.id_bu = a2.id_bu
AND a1.id_propinsi = a2.id_propinsi
) as a,
( SELECT
b.id_bu,
b.id_sub_klasifikasi_kbli,
b.kualifikasi_kbli,
b.id_asosiasi_bu,
b.propinsi,
b.tgl_permohonan,
c.tgl_habis
FROM
( SELECT
b1.id_bu,
b1.id_sub_klasifikasi_kbli,
b1.kualifikasi_kbli,
b1.id_asosiasi_bu,
b1.propinsi,
b1.tgl_permohonan
FROM
bu_registrasi_history_kbli b1
WHERE
b1.id_status = '4'
AND b1.tgl_proses < '2018-03-01' ) as b,
( SELECT
c1.id_bu,
c1.id_klasifikasi,
c1.id_asosiasi_bu,
c1.tgl_habis
FROM
bu_sbu_kbli c1
WHERE
c1.tgl_habis >= '2018-03-01' ) as c
WHERE
b.id_bu = c.id_bu
AND SUBSTR( b.id_sub_klasifikasi_kbli, 1, 3) = c.id_klasifikasi
AND b.id_asosiasi_bu = c.id_asosiasi_bu
UNION all
SELECT
d.id_bu,
d.id_sub_klasifikasi_kbli,
d.kualifikasi_kbli,
d.id_asosiasi_bu,
d.propinsi,
d.tgl_permohonan,
e.tgl_habis
FROM
( SELECT
d1.id_bu,
d1.id_sub_klasifikasi_kbli,
d1.kualifikasi_kbli,
d1.id_asosiasi_bu,
d1.propinsi,
d1.tgl_permohonan
FROM
bu_registrasi_history_kbli_hapus d1
WHERE
d1.id_status='4'
AND d1.tgl_proses<'2018-03-01' ) as d,
( SELECT
e1.id_bu,
e1.id_klasifikasi,
e1.id_asosiasi_bu,
e1.tgl_habis
FROM
bu_sbu_kbli_hapus e1
WHERE
e1.tgl_habis >= '2018-03-01' ) as e
WHERE
d.id_bu = e.id_bu
AND SUBSTR( d.id_sub_klasifikasi_kbli, 1, 3) = e.id_klasifikasi
AND d.id_asosiasi_bu = e.id_asosiasi_bu
GROUP BY
id_bu,
id_sub_klasifikasi_kbli
ORDER BY
tgl_habis,
tgl_permohonan DESC) x1
WHERE
a.id_bu = x1.id_bu
GROUP BY
x1.id_bu
我得到以下错误:
org.apache.spark.sql.AnalysisException: Reference 'id_bu' is ambiguous, could be: d.id_bu, e.id_bu.; line 81 pos 12
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
我不知道是什么错误?是关于两列同名的吗?如果我试着使用d.id\u bu和d.id\u sub\u klasifikasi\u kbli,正如最后一个groupby中的错误所建议的:
'd.`kualifikasi_kbli`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [id_bu#21], [id_bu#1, nama#2, id_Bentuk_bu#3, id_bentuk_usaha#4, id_Jenis_bu#5, id_Jenis_bu_kbli#6, alamat#7, kodepos#8, telepon#9, fax#10, email#11, website#12, id_kabupaten#13, id_propinsi#14, npwp#15, modal_dasar#0, log#17, bu_nomor#19]
你知道我怎么解决这个问题吗?谢谢您!
3条答案
按热度按时间bnlyeluc1#
这实际上是一个基本的sql错误,在这个子句中没有scala或spark特定的错误
在这里,您必须指定要按其分组的子查询的别名:
group by d.id_bu
或者group by e.id_bu
.oxf4rvwz2#
必须在GROUPBY子句中指定表
chhqkbe13#
在检查了您的查询并重新格式化以获得更好的可读性,并且了解了所有select子查询和联合中的内容之后,我发现错误是指(如上所述)GROUPBY子句,您在该子句中加入了“d”和“e”表。既然两人都有身份证,你只需要符合其中一个
d、 身份证号码
或
e、 身份证号码
但因为这是union select all的一部分,所以它也可能是您的“b”或“c”别名。
接下来是非聚合错误。要使用分组依据,必须指定要分组的列。在本例中,您按两列进行分组,但返回的是
因此,您需要确定如何将min、max、avg或其他聚合应用于非组列。如果剩下的5个可以更改,您是将它们设置为每个min()或max(),以便它们返回,还是将它们全部添加到您的组中。