pycharm中的spark avro错误[typeerror:'recordschema'对象不可iterable]

tjrkku2a  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(448)

我正在尝试运行一个简单的spark程序来读取pycharm环境中的avro文件。我不断遇到这个我无法解决的错误。我感谢你的帮助。

from environment_variables import *
import avro.schema
from pyspark.sql import SparkSession

Schema = avro.schema.parse(open(SCHEMA_PATH, "rb").read())
print(Schema)
spark = SparkSession.builder.appName("indu").getOrCreate()
df = spark.read.format("avro").load(list(Schema))
print(df)

打印的模式如下所示

{"type": "record", "name": "DefaultEventRecord", "namespace": "io.divolte.record", "fields": [{"type": "boolean", "name": "detectedDuplicate"}, {"type": "boolean", "name": "detectedCorruption"}, {"type": "boolean", "name": "firstInSession"}, {"type": "long", "name": "clientTimestamp"}, {"type": "long", "name": "timestamp"}, {"type": "string", "name": "remoteHost"}, {"type": ["null", "string"], "name": "referer", "default": null}, {"type": ["null", "string"], "name": "location", "default": null}, {"type": ["null", "int"], "name": "devicePixelRatio", "default": null}, {"type": ["null", "int"], "name": "viewportPixelWidth", "default": null}, {"type": ["null", "int"], "name": "viewportPixelHeight", "default": null}, {"type": ["null", "int"], "name": "screenPixelWidth", "default": null}, {"type": ["null", "int"], "name": "screenPixelHeight", "default": null}, {"type": ["null", "string"], "name": "partyId", "default": null}, {"type": ["null", "string"], "name": "sessionId", "default": null}, {"type": ["null", "string"], "name": "pageViewId", "default": null}, {"type": ["null", "string"], "name": "eventId", "default": null}, {"type": "string", "name": "eventType", "default": "unknown"}, {"type": ["null", "string"], "name": "userAgentString", "default": null}, {"type": ["null", "string"], "name": "userAgentName", "default": null}, {"type": ["null", "string"], "name": "userAgentFamily", "default": null}, {"type": ["null", "string"], "name": "userAgentVendor", "default": null}, {"type": ["null", "string"], "name": "userAgentType", "default": null}, {"type": ["null", "string"], "name": "userAgentVersion", "default": null}, {"type": ["null", "string"], "name": "userAgentDeviceCategory", "default": null}, {"type": ["null", "string"], "name": "userAgentOsFamily", "default": null}, {"type": ["null", "string"], "name": "userAgentOsVersion", "default": null}, {"type": ["null", "string"], "name": "userAgentOsVendor", "default": null}, {"type": ["null", "int"], "name": "cityIdField", "default": null}, {"type": ["null", "string"], "name": "cityNameField", "default": null}, {"type": ["null", "string"], "name": "countryCodeField", "default": null}, {"type": ["null", "int"], "name": "countryIdField", "default": null}, {"type": ["null", "string"], "name": "countryNameField", "default": null}]}

得到的错误是,

21/03/02 16:06:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "X:\Git_repo\Project_Red\spark_streaming\spark_scripting.py", line 15, in <module>
    df = spark.read.format("avro").load(list(jsonFormatSchema))
TypeError: 'RecordSchema' object is not iterable

我感谢你的帮助。

kse8i1jr

kse8i1jr1#

代码中必须有3个更正:
您不必单独加载模式文件,因为任何avro数据文件的头文件中都已经包含了它。
这个 load() 方法 spark.read.format("avro").load(list(Schema)) 需要指向avro文件的路径,而不是架构。 print(df) 不会产生任何有意义的结果。只是使用 df.show() 如果您想浏览avro文件中的数据。
话虽如此,您可能已经对代码中必须更改的内容有所了解:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("indu").getOrCreate()
df = spark.read.format("avro").load(DATA_PATH)
df.printSchema()
df.show(truncate=False)

相关问题