引用“column\u name”在sparksql中不明确

zphenhs4  于 2021-05-17  发布在  Spark
关注(0)|答案(3)|浏览(565)

我对spark和sql都是新手。我正在尝试使用sparkscala执行sql查询。以下是sql查询:

SELECT 
        a.*
    FROM
    (   SELECT 
                a1.id_bu,
                a1.nama,
                a1.id_Bentuk_bu,
                a1.id_bentuk_usaha,
                a1.id_Jenis_bu,
                a1.id_Jenis_bu_kbli,
                a1.alamat,
                a1.kodepos,
                a1.telepon,
                a1.fax,
                a1.email,
                a1.website,
                a1.id_kabupaten,
                a1.id_propinsi,
                a1.npwp,
                a1.no_spt as modal_dasar,
                a1.log,
                a2.bu_nomor
            FROM 
                bu a1,
                bu_nomor a2
            where
                    a1.id_bu = a2.id_bu
                AND a1.id_propinsi = a2.id_propinsi 
    ) as a,

    ( SELECT 
            b.id_bu,
            b.id_sub_klasifikasi_kbli,
            b.kualifikasi_kbli,
            b.id_asosiasi_bu,
            b.propinsi,
            b.tgl_permohonan,
            c.tgl_habis
        FROM
            ( SELECT 
                    b1.id_bu,
                    b1.id_sub_klasifikasi_kbli,
                    b1.kualifikasi_kbli,
                    b1.id_asosiasi_bu,
                    b1.propinsi,
                    b1.tgl_permohonan
                FROM 
                    bu_registrasi_history_kbli b1
                WHERE 
                        b1.id_status = '4'
                    AND b1.tgl_proses < '2018-03-01' ) as b,
            ( SELECT 
                    c1.id_bu,
                    c1.id_klasifikasi,
                    c1.id_asosiasi_bu,
                    c1.tgl_habis
                FROM 
                    bu_sbu_kbli c1
                WHERE 
                    c1.tgl_habis >= '2018-03-01' ) as c
        WHERE 
                b.id_bu = c.id_bu
            AND SUBSTR( b.id_sub_klasifikasi_kbli, 1, 3) = c.id_klasifikasi
            AND b.id_asosiasi_bu = c.id_asosiasi_bu
    UNION all 
    SELECT 
            d.id_bu,
            d.id_sub_klasifikasi_kbli,
            d.kualifikasi_kbli,
            d.id_asosiasi_bu,
            d.propinsi,
            d.tgl_permohonan,
            e.tgl_habis
        FROM
            ( SELECT 
                    d1.id_bu,
                    d1.id_sub_klasifikasi_kbli,
                    d1.kualifikasi_kbli,
                    d1.id_asosiasi_bu,
                    d1.propinsi,
                    d1.tgl_permohonan
                FROM 
                    bu_registrasi_history_kbli_hapus d1
                WHERE 
                        d1.id_status='4'
                    AND d1.tgl_proses<'2018-03-01' ) as d,
            ( SELECT 
                    e1.id_bu,
                    e1.id_klasifikasi,
                    e1.id_asosiasi_bu,
                    e1.tgl_habis
                FROM 
                    bu_sbu_kbli_hapus e1
                WHERE
                    e1.tgl_habis >= '2018-03-01' ) as e
        WHERE 
                d.id_bu = e.id_bu
            AND SUBSTR( d.id_sub_klasifikasi_kbli, 1, 3) = e.id_klasifikasi
            AND d.id_asosiasi_bu = e.id_asosiasi_bu
        GROUP BY 
            id_bu,
            id_sub_klasifikasi_kbli
        ORDER BY 
            tgl_habis,
            tgl_permohonan DESC) x1
    WHERE 
        a.id_bu = x1.id_bu
    GROUP BY 
        x1.id_bu

我得到以下错误:

org.apache.spark.sql.AnalysisException: Reference 'id_bu' is ambiguous, could be: d.id_bu, e.id_bu.; line 81 pos 12
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)

我不知道是什么错误?是关于两列同名的吗?如果我试着使用d.id\u bu和d.id\u sub\u klasifikasi\u kbli,正如最后一个groupby中的错误所建议的:

'd.`kualifikasi_kbli`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [id_bu#21], [id_bu#1, nama#2, id_Bentuk_bu#3, id_bentuk_usaha#4, id_Jenis_bu#5, id_Jenis_bu_kbli#6, alamat#7, kodepos#8, telepon#9, fax#10, email#11, website#12, id_kabupaten#13, id_propinsi#14, npwp#15, modal_dasar#0, log#17, bu_nomor#19]

你知道我怎么解决这个问题吗?谢谢您!

bnlyeluc

bnlyeluc1#

这实际上是一个基本的sql错误,在这个子句中没有scala或spark特定的错误

GROUP BY 
            id_bu

在这里,您必须指定要按其分组的子查询的别名: group by d.id_bu 或者 group by e.id_bu .

oxf4rvwz

oxf4rvwz2#

必须在GROUPBY子句中指定表

chhqkbe1

chhqkbe13#

在检查了您的查询并重新格式化以获得更好的可读性,并且了解了所有select子查询和联合中的内容之后,我发现错误是指(如上所述)GROUPBY子句,您在该子句中加入了“d”和“e”表。既然两人都有身份证,你只需要符合其中一个
d、 身份证号码

e、 身份证号码
但因为这是union select all的一部分,所以它也可能是您的“b”或“c”别名。
接下来是非聚合错误。要使用分组依据,必须指定要分组的列。在本例中,您按两列进行分组,但返回的是

d.id_bu,
d.id_sub_klasifikasi_kbli,
d.kualifikasi_kbli,
d.id_asosiasi_bu,
d.propinsi,
d.tgl_permohonan,
e.tgl_habis

因此,您需要确定如何将min、max、avg或其他聚合应用于非组列。如果剩下的5个可以更改,您是将它们设置为每个min()或max(),以便它们返回,还是将它们全部添加到您的组中。

相关问题