我试图从下面的两个列表中创建多个Dataframe,
val paths = ListBuffer("s3://abc_xyz_tableA.json",
"s3://def_xyz_tableA.json",
"s3://abc_xyz_tableB.json",
"s3://def_xyz_tableB.json",
"s3://abc_xyz_tableC.json",....)
val tableNames = ListBuffer("tableA","tableB","tableC","tableD",....)
我想使用表名创建不同的dataframe,方法是将所有公共表名结束的s3路径放在一起,因为它们具有唯一的模式。
so for example if the tables and paths related to it are brought together then -
"tableADF" will have all the data from these paths "s3://abc_xyz_tableA.json", "s3://def_xyz_tableA.json" as they have "tableA" in the path
"tableBDF" will have all the data from these paths "s3://abc_xyz_tableB.json", "s3://def_xyz_tableB.json" as they have "tableB" in the path
and so on there can be many tableNames and Paths
我正在尝试不同的方法,但还没有成功。任何实现所需解决方案的线索都将大有裨益。谢谢!
3条答案
按热度按时间tf7tbtn21#
使用
input_file_name()
udf,您可以根据文件名进行过滤,以获得每个文件/文件模式的Dataframezhte4eai2#
检查以下代码&最终结果类型为
scala.collection.immutable.Map[String,org.apache.spark.sql.DataFrame] = Map(tableBDF -> [...], tableADF -> [...], tableCDF -> [...])
哪里...
是您的列列表。9bfwbjaz3#
如果文件post-fix名称列表很长,那么您可以使用下面的内容,也可以在内联中找到代码解释