Approach-2: `Scala:` ```
val df = spark.read.table("<db>.<tab_name>")
//add input_file_name
val df1 = df.withColumn("filename",input_file_name())
# filter only the 9th bucket filename and select only required columns
val ninth_buk = df1.filter('filename.contains("000008_0")).select(df.columns.head,df.columns.tail:_*)
ninth_buk.show()
``` `pyspark:` ```
from pyspark.sql.functions import *
df = spark.read.table("<db>.<tab_name>")
df1 = df.withColumn("filename",input_file_name())
ninth_buk = df1.filter(col("filename").contains("000008_0")).select(*df.columns)
ninth_buk.show()
如果您有大量数据,我们不建议使用方法-2,因为我们需要过滤整个Dataframe。。!! In Hive: ``` set hive.support.quoted.identifiers=none; select (fn)?+.+ from ( select *,input__file__name fn from table_name)e where e.fn like '%000008_0%';
3条答案
按热度按时间whhtz7ly1#
哪里
bucketing_table
是你的bucket表名吗rta7y2nd2#
您可以通过不同的方式实现这一点:
方法一:拿到table
stored location
从desc formatted <db>.<tab_name>
然后阅读9th bucket
直接从文件HDFS filesystem
.(或)
方法2:使用输入文件名()
然后只过滤
9th bucket
使用文件名的数据Example:
Approach-1:Scala:
```val df = spark.sql("desc formatted .<tab_name>")
//get table location in hdfs path
val loc_hdfs = df.filter('col_name === "Location").select("data_type").collect.map(x => x(0)).mkString
//based on your table format change the read format
val ninth_buk = spark.read.orc(s"${loc_hdfs}/000008_0*")
//display the data
ninth_buk.show()
`Pyspark:`
from pyspark.sql.functions import *
df = spark.sql("desc formatted .<tab_name>")
loc_hdfs = df.filter(col("col_name") == "Location").select("data_type").collect()[0].getattr("data_type")
ninth_buk = spark.read.orc(loc_hdfs + "/000008_0*")
ninth_buk.show()
如果您有大量数据,我们不建议使用方法-2,因为我们需要过滤整个Dataframe。。!!
In Hive:
```set hive.support.quoted.identifiers=none;
select
(fn)?+.+
from (select *,input__file__name fn from table_name)e
where e.fn like '%000008_0%';
iovurdzv3#
如果是兽人的table