我有一个CSV文件,如下所示:
enter image description here
ID, Tags
1,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
2,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
3,"""Remarks"": "" EUC Personal Desktop"",""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"""
字符串
示例标签值如下所示。
"cluster": "csp-csp-cluster","kubernetes_node": ""
"Project": "S4","Type": "Dev"
"ApplicationID": "BSN0002278","ClusterId": "0803-071102-1n1kbom","ClusterName": "Riskman Quaitification Cluster","Creator": "[email protected]","DatabricksEnvironment": "workerenv-6764199050889752","Environment": "DEV","Remark": "EDP","RequestNo": "RITM0215195","Solution": "Riskman","Vendor": "Databricks","databricks-instance-name": "e82e5f2df9ae479bbb76766c48bdf3cc","x_Environment": "DEV"
"ApplicationID": "APP0005199","Environment": "Dev","Remark": "Buy@energy","RequestNo": "RITM0270765"
"ApplicationID": "","Environment": "","Remark": "","RequestNo": ""
"APM_ID": "empty","ApplicationID": "Empty","Environment": "PROD","Remark": "Logs","RequestNo": "empty"
型
我想在spark中将其作为 Dataframe 读取,其中每个字段的值与CSV中写入的值完全相同,但没有引号,并且需要创建名为ApplicationID的新列。
预期产出:
enter image description here的
ID, Tags,ApplicationID
1,ApplicationID : 9AAG033396,Environment : PROD,Remarks: EUC Personal Desktop,9AAG033396
2,ApplicationID : 9AAG033396,Environment : PROD,Remarks: EUC Personal Desktop,9AAG033396
3,Remarks: EUC Personal Desktop,ApplicationID : 9AAG033396,Environment : PROD,9AAG033396
型
我得到的输出
enter image description here的
ID, Tags,ApplicationID
1,"""ApplicationID "": ""9AAG033396"",9AAG033396
2,"""ApplicationID "": ""9AAG033396"",9AAG033396
3,"""Remarks"": "" EUC Personal Desktop"",null
型
在pyspark中,我是这样阅读的:
df = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.option("delimiter", ",")\
.option("escapeQuotes", "true")\
.option("multiLine","true")\
.option('quote','"')\
.csv(f"wasbs://{container}@{storage_account_name}.blob.core.windows.net/onetimeazamortizecostnew/onetimeazamortizecostnew/20230901-20231031/onetimeazamortizecostnew_17fcdeca-81ca-43e7-b181-36bc379e9644.csv")
`df2 = df.withColumn("ApplicationID", when(df.Tags.contains("ApplicationID"),substring('Tags', 23,10))
.otherwise(''))`
的字符串
我想知道是否有一种方法可以使用escape选项来实现这一点。
1条答案
按热度按时间g6baxovj1#
escape
正是您要查找的选项。您希望将其设置为"
,以便将双引号引为""
。字符串
如果你把字符串用花括号括起来,它就可以被解析为JSON,你就可以很容易地提取ApplicationID。
型
注意:你的CSV文件中似乎有一些奇怪的空格(比如
"ApplicationID "
)。不知道是不是这样,但我写了代码示例来匹配它。