如何在Pyspark中阅读CSV文件时忽略双引号?

uubf1zoe  于 12个月前  发布在  Spark
关注(0)|答案(1)|浏览(216)

我有一个CSV文件,如下所示:
enter image description here

ID, Tags
1,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
2,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
3,"""Remarks"": "" EUC Personal Desktop"",""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"""

字符串
示例标签值如下所示。

"cluster": "csp-csp-cluster","kubernetes_node": ""

"Project": "S4","Type": "Dev"

"ApplicationID": "BSN0002278","ClusterId": "0803-071102-1n1kbom","ClusterName": "Riskman Quaitification Cluster","Creator": "[email protected]","DatabricksEnvironment": "workerenv-6764199050889752","Environment": "DEV","Remark": "EDP","RequestNo": "RITM0215195","Solution": "Riskman","Vendor": "Databricks","databricks-instance-name": "e82e5f2df9ae479bbb76766c48bdf3cc","x_Environment": "DEV"

"ApplicationID": "APP0005199","Environment": "Dev","Remark": "Buy@energy","RequestNo": "RITM0270765"

"ApplicationID": "","Environment": "","Remark": "","RequestNo": ""

"APM_ID": "empty","ApplicationID": "Empty","Environment": "PROD","Remark": "Logs","RequestNo": "empty"


我想在spark中将其作为 Dataframe 读取,其中每个字段的值与CSV中写入的值完全相同,但没有引号,并且需要创建名为ApplicationID的新列。
预期产出:
enter image description here

ID, Tags,ApplicationID
1,ApplicationID : 9AAG033396,Environment : PROD,Remarks:  EUC Personal Desktop,9AAG033396
2,ApplicationID : 9AAG033396,Environment : PROD,Remarks:  EUC Personal Desktop,9AAG033396
3,Remarks:  EUC Personal Desktop,ApplicationID : 9AAG033396,Environment : PROD,9AAG033396


我得到的输出
enter image description here

ID, Tags,ApplicationID
1,"""ApplicationID "": ""9AAG033396"",9AAG033396
2,"""ApplicationID "": ""9AAG033396"",9AAG033396
3,"""Remarks"": "" EUC Personal Desktop"",null


在pyspark中,我是这样阅读的:

df = spark.read\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .option("delimiter", ",")\
  .option("escapeQuotes", "true")\
  .option("multiLine","true")\
  .option('quote','"')\
.csv(f"wasbs://{container}@{storage_account_name}.blob.core.windows.net/onetimeazamortizecostnew/onetimeazamortizecostnew/20230901-20231031/onetimeazamortizecostnew_17fcdeca-81ca-43e7-b181-36bc379e9644.csv")
`df2 = df.withColumn("ApplicationID", when(df.Tags.contains("ApplicationID"),substring('Tags', 23,10))
                                    .otherwise(''))`

的字符串
我想知道是否有一种方法可以使用escape选项来实现这一点。

g6baxovj

g6baxovj1#

escape正是您要查找的选项。您希望将其设置为",以便将双引号引为""

df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("escape",'"')
  .csv(f"wasbs://{container}@{storage_account_name}.blob.core.windows.net/onetimeazamortizecostnew/onetimeazamortizecostnew/20230901-20231031/onetimeazamortizecostnew_17fcdeca-81ca-43e7-b181-36bc379e9644.csv")
+---+----------------------------------------------------------------------------------------+
|ID | Tags                                                                                   |
+---+----------------------------------------------------------------------------------------+
|1  |"ApplicationID ": "9AAG033396","Environment ": "PROD","Remarks": " EUC Personal Desktop"|
|2  |"ApplicationID ": "9AAG033396","Environment ": "PROD","Remarks": " EUC Personal Desktop"|
|3  |"Remarks": " EUC Personal Desktop","ApplicationID ": "9AAG033396","Environment ": "PROD"|
+---+----------------------------------------------------------------------------------------+

字符串
如果你把字符串用花括号括起来,它就可以被解析为JSON,你就可以很容易地提取ApplicationID。

schema = StructType([StructField(c, StringType()) for c in ["ApplicationID ", "Environment ", "Remarks"]])
df.withColumn("Tags", from_json(concat(lit("{"), col(" Tags"), lit("}")), schema))\
  .withColumn("ApplicationID", col("Tags.ApplicationID "))
+---+-----------------------------------------+-------------+
|ID |Tags                                     |ApplicationID|
+---+-----------------------------------------+-------------+
|1  |{9AAG033396, PROD,  EUC Personal Desktop}|9AAG033396   |
|2  |{9AAG033396, PROD,  EUC Personal Desktop}|9AAG033396   |
|3  |{9AAG033396, PROD,  EUC Personal Desktop}|9AAG033396   |
+---+-----------------------------------------+-------------+


注意:你的CSV文件中似乎有一些奇怪的空格(比如"ApplicationID ")。不知道是不是这样,但我写了代码示例来匹配它。

相关问题