json 是否有方法读取嵌套列？

我有一堆以换行符分隔的JSON文件，我想使用arrow包将它们读入R。
文件中的一个参数是嵌套的。潜在的嵌套值相当大和混乱，我宁愿只选择我实际需要的嵌套参数。
以下是我正在处理的数据的示例：

# Bring in libraries
suppressMessages(library(arrow))
suppressMessages(library(data.table))

# Make data
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
    { "hello": 3.5, "world": false, "yo":{"param1":"duck1","param2":"duck2"} }
    { "hello": 3.25, "world": null, "yo":{"param1":"duck3","param2":"duck4"} }
    { "hello": 0.0, "world": true, "yo":{"param1":"duck5","param2":"duck6"} }
  ', tf, useBytes = TRUE)
df <- read_json_arrow(tf)

这是我刚刚读到的结果：

read_json_arrow(tf, col_select = "yo")

我也可以在“哟”栏中阅读。结果如下：

但是我在阅读“yo.param1”数据元素时遇到了问题：

有什么关于如何读取这个嵌套列并避免阅读整个列的想法吗？

当你使用read_*函数读入一个对象时，你是将它们作为Arrow表读入，Arrow表存储在内存中。Arrow是围绕零拷贝操作设计的，这意味着如果你可以直接操作Arrow对象而不是将它们拉入R，这应该有助于在处理较大对象时不创建对象的中间副本并炸毁你的R会话。
我有一个潜在的解决方案，涉及使用Arrow对象，直到最后一刻将数据拉入R，尽管它不是最优雅的。

# Bring in libraries
suppressMessages(library(arrow))

# Make data
tf <- tempfile()
on.exit(unlink(tf))
writeLines('
    { "hello": 3.5, "world": false, "yo":{"param1":"duck1","param2":"duck2"} }
    { "hello": 3.25, "world": null, "yo":{"param1":"duck3","param2":"duck4"} }
    { "hello": 0.0, "world": true, "yo":{"param1":"duck5","param2":"duck6"} }
  ', tf, useBytes = TRUE)

# read in the JSON table as an Arrow Table
my_tbl <- read_json_arrow(tf, col_select = c("hello", "world"), as_data_frame = FALSE)
complex_cols <- read_json_arrow(tf, col_select = "yo", as_data_frame = FALSE)

# subselect the "yo" column - this is an Arrow ChunkedArray object 
# containing a Struct at position 0
yo_col <- complex_cols[["yo"]]
yo_col
#> ChunkedArray
#> <struct<param1: string, param2: string>>
#> [
#>   -- is_valid: all not null
#>   -- child 0 type: string
#>     [
#>       "duck1",
#>       "duck3",
#>       "duck5"
#>     ]
#>   -- child 1 type: string
#>     [
#>       "duck2",
#>       "duck4",
#>       "duck6"
#>     ]
#> ]

# extract the Struct by passing in the chunk number
sa <- yo_col$chunk(0)
sa
#> StructArray
#> <struct<param1: string, param2: string>>
#> -- is_valid: all not null
#> -- child 0 type: string
#>   [
#>     "duck1",
#>     "duck3",
#>     "duck5"
#>   ]
#> -- child 1 type: string
#>   [
#>     "duck2",
#>     "duck4",
#>     "duck6"
#>   ]

# extract the "param1" column from the Struct
param1_col <- sa[["param1"]]
param1_col
#> Array
#> <string>
#> [
#>   "duck1",
#>   "duck3",
#>   "duck5"
#> ]

# Add the param1 column to the original Table
my_tbl[["param1"]] = param1_col
my_tbl
#> Table
#> 3 rows x 3 columns
#> $hello <double>
#> $world <bool>
#> $param1 <string>

# now pull the table into R
dplyr::collect(my_tbl)
#> # A tibble: 3 × 3
#>   hello world param1
#>   <dbl> <lgl> <chr> 
#> 1  3.5  FALSE duck1 
#> 2  3.25 NA    duck3 
#> 3  0    TRUE  duck5

我一直在寻找如何在tidyverse中直接完成这一点（我们在tidyverse设计之后模拟了很多arrow包设计），但我看到的许多解决方案都涉及在dplyr::select()中运行purrr::map()，这是一个目前在arrow中没有实现的工作流程，我不知道这是否可能。如果你想提出功能请求，请随意使用open a ticket on the repo。
最后一点：在上面的例子中，这可能不会对内存占用产生太大的影响，但是如果你有很多嵌套的项要提取并重新组装到一个表中，那么你可能会看到更多的好处。

json 是否有方法读取嵌套列？

1条答案

相关问题

热门标签

最新问答