在sparkscala中解析带转义引号的json字符串

vlf7wbxs  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(919)

这个问题在这里已经有答案了

在scala[closed]中解析字符串(2个答案)
两个月前关门了。
我正试图使用scala将下面给出的字符串解析为json,但由于转义引号的缘故,我一直未能这样做 \" 发生在田野里。

{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r   <RegistrationInfo>\r     <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r     <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r   </RegistrationInfo>\r   <Triggers>\r     <TimeTrigger>\r       <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r       <Enabled>true</Enabled>\r     </TimeTrigger>\r   </Triggers>\r   <Settings>\r     <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r     <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r     <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r     <AllowHardTerminate>true</AllowHardTerminate>\r     <StartWhenAvailable>false</StartWhenAvailable>\r     <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r     <IdleSettings>\r       <Duration>PT10M</Duration>\r       <WaitTimeout>PT1H</WaitTimeout>\r       <StopOnIdleEnd>true</StopOnIdleEnd>\r       <RestartOnIdle>false</RestartOnIdle>\r     </IdleSettings>\r     <AllowStartOnDemand>true</AllowStartOnDemand>\r     <Enabled>true</Enabled>\r     <Hidden>false</Hidden>\r     <RunOnlyIfIdle>false</RunOnlyIfIdle>\r     <WakeToRun>false</WakeToRun>\r     <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r     <Priority>7</Priority>\r   </Settings>\r   <Actions Context=\"Author\">\r     <Exec>\r       <Command>%systemroot%\\system32\\usoclient.exe</Command>\r       <Arguments>StartUWork</Arguments>\r     </Exec>\r   </Actions>\r   <Principals>\r     <Principal id=\"Author\">\r       <UserId>S-1-5-18</UserId>\r       <RunLevel>LeastPrivilege</RunLevel>\r     </Principal>\r   </Principals>\r </Task>\"}

到目前为止,我已经试过了 spark.json.read 以及 net.liftweb 图书馆,但无济于事。
任何形式的帮助都是非常感谢的。

kxe2p93d

kxe2p93d1#

您得到的json输出可能不是有效的json,或者如果json是有效的,那么它在 TaskContent 元素,该元素具有带有属性的xml标记,我认为这就是导致问题的原因。我的想法是从xml属性值中删除双引号,然后进行解析。您可以将双引号替换为任何特定的值,一旦将“taskcontent”作为Dataframe列,就可以再次替换它以获取原始内容。
这可能不是一个完美或有效的答案,但基于您获取json的方式,如果json结构保持不变,那么您可以执行以下操作:
将您必须使用的json转换为字符串。
对字符串执行一些replaceall操作,使其看起来像有效的json。
将json读入dataframe。
//从问题复制的源数据

val json = """{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r   <RegistrationInfo>\r     <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r     <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r   </RegistrationInfo>\r   <Triggers>\r     <TimeTrigger>\r       <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r       <Enabled>true</Enabled>\r     </TimeTrigger>\r   </Triggers>\r   <Settings>\r     <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r     <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r     <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r     <AllowHardTerminate>true</AllowHardTerminate>\r     <StartWhenAvailable>false</StartWhenAvailable>\r     <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r     <IdleSettings>\r       <Duration>PT10M</Duration>\r       <WaitTimeout>PT1H</WaitTimeout>\r       <StopOnIdleEnd>true</StopOnIdleEnd>\r       <RestartOnIdle>false</RestartOnIdle>\r     </IdleSettings>\r     <AllowStartOnDemand>true</AllowStartOnDemand>\r     <Enabled>true</Enabled>\r     <Hidden>false</Hidden>\r     <RunOnlyIfIdle>false</RunOnlyIfIdle>\r     <WakeToRun>false</WakeToRun>\r     <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r     <Priority>7</Priority>\r   </Settings>\r   <Actions Context=\"Author\">\r     <Exec>\r       <Command>%systemroot%\\system32\\usoclient.exe</Command>\r       <Arguments>StartUWork</Arguments>\r     </Exec>\r   </Actions>\r   <Principals>\r     <Principal id=\"Author\">\r       <UserId>S-1-5-18</UserId>\r       <RunLevel>LeastPrivilege</RunLevel>\r     </Principal>\r   </Principals>\r </Task>\"}"""

//Modifying json to make it valid
val modifiedJson = json.replaceAll("\\\\\\\\","@").replaceAll("\\\\r","").replaceAll("\\\\","").replaceAll("   ","").replaceAll("  ","").replaceAll("> <","><").replaceAll("=\"","=").replaceAll("\">",">").replaceAll("@","\\\\\\\\").replaceAll("1.0\"","1.0").replaceAll("UTF-16\"?","UTF-16").replaceAll("1.2\"","1.2")
//creating a dataset out of json String
val ds = spark.createDataset(modifiedJson :: Nil)
//reading the dataset as json
val df = spark.read.json(ds)

您可以看到如下输出:

你可以做一些优化,使它以一种更有效的方式工作,但这就是我如何使它工作。

ejk8hzay

ejk8hzay2#

可以使用替换转义引号 \" 在json中 " (xml内容中的内容除外)使用 regexp_replace 函数然后读入Dataframe:

val jsonString = """
{\"TimeCreated\":\"2021-01-09T04:29:21.413Z\",\"Computer\":\"WIN-10.atfdetonate.local\",\"Channel\":\"Security\",\"Provider\":\"Microsoft-Windows-Security-Auditing\",\"Category\":\"AUDIT_SUCCESS\",\"Version\":\"1\",\"EventID\":\"4698\",\"EventRecordID\":\"12956650\",\"SubjectUserSid\":\"S-1-5-18\",\"SubjectUserName\":\"WIN-10$\",\"SubjectDomainName\":\"ATFDETONATE\",\"SubjectLogonId\":\"0x3e7\",\"TaskName\":\"\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start\",\"TaskContent\":\"<?xml version=\"1.0\" encoding=\"UTF-16\"?>\r <Task version=\"1.2\" xmlns=\"http://schemas.microsoft.com/windows/2004/02/mit/task\">\r   <RegistrationInfo>\r     <URI>\\Microsoft\\Windows\\UpdateOrchestrator\\Universal Orchestrator Start</URI>\r     <SecurityDescriptor>D:P(A;;FA;;;SY)(A;;FRFX;;;LS)(A;;FRFX;;;BA)</SecurityDescriptor>\r   </RegistrationInfo>\r   <Triggers>\r     <TimeTrigger>\r       <StartBoundary>2021-01-09T11:42:00.000Z</StartBoundary>\r       <Enabled>true</Enabled>\r     </TimeTrigger>\r   </Triggers>\r   <Settings>\r     <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy>\r     <DisallowStartIfOnBatteries>true</DisallowStartIfOnBatteries>\r     <StopIfGoingOnBatteries>false</StopIfGoingOnBatteries>\r     <AllowHardTerminate>true</AllowHardTerminate>\r     <StartWhenAvailable>false</StartWhenAvailable>\r     <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>\r     <IdleSettings>\r       <Duration>PT10M</Duration>\r       <WaitTimeout>PT1H</WaitTimeout>\r       <StopOnIdleEnd>true</StopOnIdleEnd>\r       <RestartOnIdle>false</RestartOnIdle>\r     </IdleSettings>\r     <AllowStartOnDemand>true</AllowStartOnDemand>\r     <Enabled>true</Enabled>\r     <Hidden>false</Hidden>\r     <RunOnlyIfIdle>false</RunOnlyIfIdle>\r     <WakeToRun>false</WakeToRun>\r     <ExecutionTimeLimit>PT72H</ExecutionTimeLimit>\r     <Priority>7</Priority>\r   </Settings>\r   <Actions Context=\"Author\">\r     <Exec>\r       <Command>%systemroot%\\system32\\usoclient.exe</Command>\r       <Arguments>StartUWork</Arguments>\r     </Exec>\r   </Actions>\r   <Principals>\r     <Principal id=\"Author\">\r       <UserId>S-1-5-18</UserId>\r       <RunLevel>LeastPrivilege</RunLevel>\r     </Principal>\r   </Principals>\r </Task>\"}
"""

val df = spark.read.json(
  Seq(jsonString).toDS
    .withColumn("value", regexp_replace($"value", """([:\[,{]\s*)\\"(.*?)\\"(?=\s*[:,\]}])""", "$1\"$2\""))
    .as[String]
)

df.show  
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+
//|     Category| Channel|            Computer|EventID|EventRecordID|            Provider|SubjectDomainName|SubjectLogonId|SubjectUserName|SubjectUserSid|         TaskContent|            TaskName|         TimeCreated|Version|
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+
//|AUDIT_SUCCESS|Security|WIN-10.atfdetonat...|   4698|     12956650|Microsoft-Windows...|      ATFDETONATE|         0x3e7|        WIN-10$|      S-1-5-18|<?xml version="1....|\Microsoft\Window...|2021-01-09T04:29:...|      1|
//+-------------+--------+--------------------+-------+-------------+--------------------+-----------------+--------------+---------------+--------------+--------------------+--------------------+--------------------+-------+

相关问题