假设我有以下20个地址,我想将列表分成4组,每组5个地址,并将每组顺序输入地理编码器。
library(tidyverse)
df <- tibble::tribble(
~num_street, ~city, ~sate, ~zip_code,
"976 FAIRVIEW DR", "SPRINGFIELD", "OR", 97477L,
"19843 HWY 213", "OREGON CITY", "OR", 97045L,
"402 CARL ST", "DRAIN", "OR", 97435L,
"304 WATER ST", "WESTON", "OR", 97886L,
"5054 TECHNOLOGY LOOP", "CORVALLIS", "OR", 97333L,
"3401 YACHT AVE", "LINCOLN CITY", "OR", 97367L,
"135 ROOSEVELT AVE", "BEND", "OR", 97702L,
"3631 FENWAY ST", "FOREST GROVE", "OR", 97116L,
"92250 HILLTOP LN", "COQUILLE", "OR", 97423L,
"6920 92ND AVE", "TIGARD", "OR", 97223L,
"591 LAUREL ST", "JUNCTION CITY", "OR", 97448L,
"32035 LYNX HOLLOW RD", "CRESWELL", "OR", 97426L,
"6280 ASTER ST", "SPRINGFIELD", "OR", 97478L,
"17533 VANGUARD LN", "BEAVERTON", "OR", 97007L,
"59937 CHEYENNE RD", "BEND", "OR", 97702L,
"2232 42ND AVE", "SALEM", "OR", 97317L,
"3100 TURNER RD", "SALEM", "OR", 97302L,
"3495 CHAMBERS ST", "EUGENE", "OR", 97405L,
"585 WINTER ST", "SALEM", "OR", 97301L,
"23985 VAUGHN RD", "VENETA", "OR", 97487L
)
我用来进行地理编码的代码是:
library(censusxy)
system.time({
dropme_dta <-
cxy_geocode(df,
street = 'num_street',
city = 'city',
state = 'state',
zip = 'zip_code',
return = 'geographies',
class = 'dataframe',
output = 'full',
parallel = 8,
vintage = 4,
timeout = 30)
})
我特别喜欢不使用循环并停留在tidyverse中的方法。也就是说,我认为可能有一种方法可以使用purrr::reduce()
,但对于我的生活,我还没有能够弄清楚。
任何指针,我会非常感激!
P.S.我知道我可以将所有20个地址传递给地理编码器,但实际上我有大约400万个地址,我想通过打印批号来跟踪它的批次
- 编辑:* 基于评论中的反馈,我同意循环是最好的前进方式。这是我到目前为止所拥有的:
library(tidygeocoder)
df <- df %>%
group_by(group_id = row_number() %/% 5)
for (x in 0:max(df$group_id)) {
cat(paste("\rgeocoding batch", x, "of", max(df$group_id), "\n"))
Sys.sleep(1)
df %>%
geocode(street = num_street, city = city, state = state, postalcode = zip_code,
method = "census", full_results = TRUE, api_options = list(census_return_type = 'geographies'))
}
但是我不知道如何迭代地建立df。如果我把geocode()
函数赋给某个对象,它会在每次迭代时覆盖它。
1条答案
按热度按时间gr8qqesn1#
根据您最近的编辑,您可以将中间步骤保存到列表中,然后将最终结果加入到tibble中。就像这样:
在列表
l
中,你有每一步的计算,调用do.call
,你将结果连接到同一个tibble中。考虑到数据集的大小很大,在结束循环之前可能会出现内存问题。在这种情况下,您可以将中间结果保存到文件中(每n个批次将结果保存到文件/清空列表/继续)。所有部分结果都可以在最后合并。
或者,您可以尝试构建一个具有与预期相同行数和列数的虚拟df,并在每次迭代后替换这些值。这种方法可能会更慢。