我是spark的新手,我们在spark数据框架中执行所有的配置单元语句,我们可以使用spark.sql运行整个sql语句,但是我们需要使用spark转换
配置单元查询
select REGEXP_REPLACE(nid,'"','') as hash,REGEXP_REPLACE(emailnew,'"','') as email,REGEXP_REPLACE(employer1website,'"','') as domain,REGEXP_REPLACE(city,'"','') as city,
REGEXP_REPLACE(state,'"','') as state,REGEXP_REPLACE(email_domain,'"','') as emaildomain
from mytestdb.cookiedata_cleansed
where (REGEXP_REPLACE(nid,'"','')) in (select (REGEXP_REPLACE(md5,'"','')) from mytestdb.ckg_distincthash_01dec)
group by REGEXP_REPLACE(nid,'"',''),REGEXP_REPLACE(emailnew,'"',''),REGEXP_REPLACE(employer1website,'"',''),REGEXP_REPLACE(city,'"',''),
REGEXP_REPLACE(state,'"',''),REGEXP_REPLACE(email_domain,'"','');
我能做到这一点,我们如何在这里使用第二个df,以及regex的where条件。
val df3 = df.withColumn("nid",regexp_replace(col("nid"),'"','').alias("hash"))
.withColumn("emailnew",regexp_replace(col("emailnew"),'"','').alias("email"))
.withColumn("employer1website",regexp_replace(col("employer1website"),'"','').alias("domain"))
.withColumn("city",regexp_replace(col("city"),'"','').alias("city"))
暂无答案!
目前还没有任何答案,快来回答吧!