在R中解析JSON:词法错误-json文本中的字符无效

rwqw0loc  于 2022-11-26  发布在  其他
关注(0)|答案(2)|浏览(283)

我在R中有一个文件(“my_file”),看起来像这样:

NAME                                                                                                                                                                                     Address_Parse
1 name1 [('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]
2 name2 [('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]

为了防止结构混乱,文件如下所示

my_file = structure(list(NAME = c("name1", "name2"), Address_Parse = c("[('372', 'StreetNumber'), ('river', 'StreetName'), ('St', 'StreetType'), ('S', 'StreetDirection'), ('toronto', 'Municipality'), ('ON', 'Province'), ('A1C', 'PostalCode'), ('9R7', 'PostalCode')]", 
"[('208', 'StreetNumber'), ('ocean', 'StreetName'), ('St', 'StreetType'), ('E', 'StreetDirection'), ('Toronto', 'Municipality'), ('ON', 'Province'), ('J8N', 'PostalCode'), ('1G8', 'PostalCode')]"
)), class = "data.frame", row.names = c(NA, -2L))

目标:对于每一行,我想取每个“元素”(例如“StreetNumber”、“StreetName”、“StreetType”等),并将其转换为一个新列。看起来如下:

name StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode
1 name1          372      river         St               S      toronto       ON     A1C9R7
2 name2          208      ocean         St               E      Toronto       ON     J8N1G8

对我来说,地址字段似乎是JSON格式的(我可能是错的)。我试着寻找不同的方法来解析JSON。例如,我试着应用下面提供的答案(R: convert nested JSON in a data frame column to addtional columns in the same data frame):

library(dplyr)
library(tidyr)
library(purrr)
library(jsonlite)

final = my_file %>%
  mutate(
    json_parsed = map(Address_Parse, ~ fromJSON(., flatten=TRUE))
  ) %>%
  unnest(json_parsed)

但是,这会产生以下错误:

Error in `mutate()`:
! Problem while computing `json_parsed = map(Address_Parse, ~fromJSON(., flatten = TRUE))`.
Caused by error:
! lexical error: invalid char in json text.
                                      [('372', 'StreetNumber'), ('rive
                     (right here) ------^
Run `rlang::last_error()` to see where the error occurred.

我又尝试了另一种方法:

final <- my_file %>% 
          rowwise() %>%
          do(data.frame(fromJSON(.$Address_Parse , flatten = T))) %>%
          ungroup() %>%
          bind_cols(my_file  %>% select(-Address_Parse ))

但我现在得到一个新的错误:

Error: lexical error: invalid char in json text.
                                      [('372', 'StreetNumber'), ('rive
                     (right here) ------^

谁能告诉我怎么解决这个问题?
谢谢你,谢谢你

06odsfpq

06odsfpq1#

您可能需要稍微重新调整JSON的格式才能使其正常工作。
我使用了stream_in函数而不是fromJSON,因为它通常更快,并且可以自动处理很多事情。

library(jsonlite)
out <- stream_in(textConnection(chartr("()'", '[]"', my_file$Address_Parse)))
s <- seq(1, ncol(out)/2)
setNames(out[s], unlist(out[1, -s]))

#  StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode PostalCode
#1          372      river         St               S      toronto       ON        A1C        9R7
#2          208      ocean         St               E      Toronto       ON        J8N        1G8
qc6wkl3g

qc6wkl3g2#

在使用fromJSON之前,我们可能需要对文本进行一些修改-即,保留"key":value,而不是(value, 'key'),并在[]之前、之后插入{}

library(dplyr)
library(purrr)
library(stringr)
library(jsonlite)
library(tidyr)
my_file  %>% 
  mutate(Address_Parse = str_replace_all(Address_Parse,
      "\\(([^,]+),\\s*([^)]+)\\)", "\\2:\\1") %>% 
   str_replace(fixed("["), "[{") %>%
   str_replace(fixed("]"), "}]") %>%
   str_replace_all(fixed("'"), '"') %>% 
   map(fromJSON)) %>%
   unnest(Address_Parse) %>%
 type.convert(as.is = TRUE)
  • 输出
A tibble: 2 × 8
  NAME  StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode
  <chr>        <int> <chr>      <chr>      <chr>           <chr>        <chr>    <chr>     
1 name1          372 river      St         S               toronto      ON       A1C       
2 name2          208 ocean      St         E               Toronto      ON       J8N

或者使用reticulate,因为它似乎是元组

library(reticulate)
py_run_string(paste0("tmp=", paste(my_file$Address_Parse, 
            collapse = ",")))
out <- cbind(my_file[1], do.call(rbind, lapply(py$tmp, \(x) 
  do.call(cbind, lapply(x, \(y) setNames(data.frame(y[[1]]), 
    y[[2]]))))))
  • 输出
> out
   NAME StreetNumber StreetName StreetType StreetDirection Municipality Province PostalCode PostalCode
1 name1          372      river         St               S      toronto       ON        A1C        9R7
2 name2          208      ocean         St               E      Toronto       ON        J8N        1G8

相关问题