jupyter笔记本上的sparkxml

7rtdyuoh 于 2021-07-14 发布在 Spark

关注(0)|答案(1)|浏览(589)

我试图在我的jupyter笔记本上运行sparkxml，以便使用spark读取xml文件。

from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'

我发现这是使用它的方法。但当我尝试导入 com.databricks.spark.xml._ ，我得到一个错误
没有名为“com”的模块

apache-spark pyspark jupyter-notebook

来源：https://stackoverflow.com/questions/66672807/spark-xml-on-jupyter-notebook

1条答案

按热度按时间

dohp0rv51#

我看到您无法使用pyspark和databricks lib按原样加载xml文件，此问题经常发生，请尝试从终端或笔记本中以shell命令的形式运行此命令：

pyspark --packages com.databricks:spark-xml_2.11:0.4.1

如果它不工作，你可以尝试这项工作，因为你可以读取你的文件作为文本，然后解析它。


# define your parser function: input is rdd:

def parse_xml(rdd):
    """
    Read the xml string from rdd, parse and extract the elements,
    then return a list of list.
    """

    return results

# read the file as text at a RDD level

file_rdd = spark.read.text("/path/to/data/*.xml", wholetext=True).rdd

# parse xml tree, extract the records and transform to new RDD

records_rdd = file_rdd.flatMap(parse_xml)

# convert RDDs to DataFrame with the pre-defined schema

output_df = records_rdd.toDF(my_schema)

如果.todf不起作用，请导入spark.implicit。

赞(0）回复(0）举报 2021-07-14

我来回答

jupyter笔记本上的sparkxml

1条答案

相关问题

热门标签

最新问答