将大列表的地址拆分并分批输入地理编码器。

huangapple go评论69阅读模式
英文:

split large list of addresses and feed batches into geocoder

问题

以下是您提供的代码的翻译部分:

如果我有以下20个地址,并且我想将列表分成4组,每组5个地址,然后将每组顺序地输入地理编码器。

库(tidyverse)

df <- tibble::tribble(
              ~num_street,           ~city, ~sate, ~zip_code,
        "976 FAIRVIEW DR",   "SPRINGFIELD",  "OR",    97477L,
          "19843 HWY 213",   "OREGON CITY",  "OR",    97045L,
            "402 CARL ST",         "DRAIN",  "OR",    97435L,
           "304 WATER ST",        "WESTON",  "OR",    97886L,
   "5054 TECHNOLOGY LOOP",     "CORVALLIS",  "OR",    97333L,
         "3401 YACHT AVE",  "LINCOLN CITY",  "OR",    97367L,
      "135 ROOSEVELT AVE",          "BEND",  "OR",    97702L,
         "3631 FENWAY ST",  "FOREST GROVE",  "OR",    97116L,
       "92250 HILLTOP LN",      "COQUILLE",  "OR",    97423L,
          "6920 92ND AVE",        "TIGARD",  "OR",    97223L,
          "591 LAUREL ST", "JUNCTION CITY",  "OR",    97448L,
   "32035 LYNX HOLLOW RD",      "CRESWELL",  "OR",    97426L,
          "6280 ASTER ST",   "SPRINGFIELD",  "OR",    97478L,
      "17533 VANGUARD LN",     "BEAVERTON",  "OR",    97007L,
      "59937 CHEYENNE RD",          "BEND",  "OR",    97702L,
          "2232 42ND AVE",         "SALEM",  "OR",    97317L,
         "3100 TURNER RD",         "SALEM",  "OR",    97302L,
       "3495 CHAMBERS ST",        "EUGENE",  "OR",    97405L,
          "585 WINTER ST",         "SALEM",  "OR",    97301L,
        "23985 VAUGHN RD",        "VENETA",  "OR",    97487L
  )

这是我用于地理编码的代码:

库(censusxy)

system.time({
  dropme_dta <- 
    cxy_geocode(df, 
                street = 'num_street', 
                city = 'city', 
                state = 'state', 
                zip = 'zip_code', 
                return = 'geographies', 
                class = 'dataframe', 
                output = 'full', 
                parallel = 8, 
                vintage = 4,
                timeout = 30)
})

我特别关注不使用循环并保持在tidyverse中的方法。我认为可以使用purrr::reduce()来实现,但是我一直无法弄清楚。

有任何指针,我将非常感激!

P.S. 我知道我可以将所有20个地址传递给地理编码器,但在实际操作中,我有大约4百万个地址,并且我想通过打印批次号来跟踪它的批次号

编辑: 基于评论中的反馈,我同意循环是最佳方法。这是我到目前为止的代码:

库(tidygeocoder)

df <- df %>% 
  group_by(group_id = row_number() %/% 5)

for (x in 0:max(df$group_id)) {
  cat(paste("\rgeocoding batch", x, "of", max(df$group_id), "\n"))
  Sys.sleep(1)
  df %>% 
    geocode(street = num_street, city = city, state = state, postalcode = zip_code, 
           method = "census", full_results = TRUE, api_options = list(census_return_type = 'geographies'))
}

但我不知道如何逐步构建df。如果我将geocode()函数分配给某物,它将在每次迭代中都被覆盖。

英文:

Say I have the following 20 addresses, and I want to split the list into 4 groups of 5 addresses each, and feed each group sequentially into a geocoder.

library(tidyverse)

df &lt;- tibble::tribble(
              ~num_street,           ~city, ~sate, ~zip_code,
        &quot;976 FAIRVIEW DR&quot;,   &quot;SPRINGFIELD&quot;,  &quot;OR&quot;,    97477L,
          &quot;19843 HWY 213&quot;,   &quot;OREGON CITY&quot;,  &quot;OR&quot;,    97045L,
            &quot;402 CARL ST&quot;,         &quot;DRAIN&quot;,  &quot;OR&quot;,    97435L,
           &quot;304 WATER ST&quot;,        &quot;WESTON&quot;,  &quot;OR&quot;,    97886L,
   &quot;5054 TECHNOLOGY LOOP&quot;,     &quot;CORVALLIS&quot;,  &quot;OR&quot;,    97333L,
         &quot;3401 YACHT AVE&quot;,  &quot;LINCOLN CITY&quot;,  &quot;OR&quot;,    97367L,
      &quot;135 ROOSEVELT AVE&quot;,          &quot;BEND&quot;,  &quot;OR&quot;,    97702L,
         &quot;3631 FENWAY ST&quot;,  &quot;FOREST GROVE&quot;,  &quot;OR&quot;,    97116L,
       &quot;92250 HILLTOP LN&quot;,      &quot;COQUILLE&quot;,  &quot;OR&quot;,    97423L,
          &quot;6920 92ND AVE&quot;,        &quot;TIGARD&quot;,  &quot;OR&quot;,    97223L,
          &quot;591 LAUREL ST&quot;, &quot;JUNCTION CITY&quot;,  &quot;OR&quot;,    97448L,
   &quot;32035 LYNX HOLLOW RD&quot;,      &quot;CRESWELL&quot;,  &quot;OR&quot;,    97426L,
          &quot;6280 ASTER ST&quot;,   &quot;SPRINGFIELD&quot;,  &quot;OR&quot;,    97478L,
      &quot;17533 VANGUARD LN&quot;,     &quot;BEAVERTON&quot;,  &quot;OR&quot;,    97007L,
      &quot;59937 CHEYENNE RD&quot;,          &quot;BEND&quot;,  &quot;OR&quot;,    97702L,
          &quot;2232 42ND AVE&quot;,         &quot;SALEM&quot;,  &quot;OR&quot;,    97317L,
         &quot;3100 TURNER RD&quot;,         &quot;SALEM&quot;,  &quot;OR&quot;,    97302L,
       &quot;3495 CHAMBERS ST&quot;,        &quot;EUGENE&quot;,  &quot;OR&quot;,    97405L,
          &quot;585 WINTER ST&quot;,         &quot;SALEM&quot;,  &quot;OR&quot;,    97301L,
        &quot;23985 VAUGHN RD&quot;,        &quot;VENETA&quot;,  &quot;OR&quot;,    97487L
  )

And the code i'm using to geocode is:

library(censusxy)

system.time({
  dropme_dta &lt;- 
    cxy_geocode(df, 
                street = &#39;num_street&#39;, 
                city = &#39;city&#39;, 
                state = &#39;state&#39;, 
                zip = &#39;zip_code&#39;, 
                return = &#39;geographies&#39;, 
                class = &#39;dataframe&#39;, 
                output = &#39;full&#39;, 
                parallel = 8, 
                vintage = 4,
                timeout = 30)
})

I am particularly in approaches that do not use loops and stay in the tidyverse. I.e. i think there may be a way using purrr::reduce() but for the life of me i haven't been able to figure it out.

Any pointers and i'd be most grateful!

P.S. I know that I can just pass all 20 addresses to the geocoder, but in practice I have about 4mn addresses and I want to keep track of what batch it's on by printing out the batch number

EDIT: based on feedback in comments, I agree that a loop is the best way forward. This is what I have so far:

library(tidygeocoder)

df &lt;- df %&gt;% 
  group_by(group_id = row_number() %/% 5)

for (x in 0:max(df$group_id)) {
  cat(paste(&quot;\rgeocoding batch&quot;, x, &quot;of&quot;, max(df$group_id), &quot;\n&quot;))
  Sys.sleep(1)
  df %&gt;% 
    geocode(street = num_street, city = city, state = state, postalcode = zip_code, 
           method = &quot;census&quot;, full_results = TRUE, api_options = list(census_return_type = &#39;geographies&#39;))
}

But I don't know how to iteratively build up the df. If I assign the geocode() function to something it's going to overwrite on each iteration.

答案1

得分: 1

根据您最新的编辑,您可以将中间步骤保存到一个列表中,然后将最终结果合并成一个 tibble。类似这样:

# 每次迭代要考虑的批次
ind  <- 1:4 # 每批4行
intL <- length(ind)
nr   <- nrow(df)

# 分配结果的列表
l <- list()
lind <- 1 

# 循环
continue <- TRUE
while(continue){

  if(nr %in% ind){
    ind <- ind[ind <= nr]
    continue <- FALSE
  }
  
  l[[lind]] <- df[ind,] %>%
    geocode(street = num_street, city = city, state = state, postalcode = zip_code, 
            method = "census", full_results = TRUE, api_options = list(census_return_type = 'geographies'))
  
  lind <- lind + 1 
  ind <-  ind  + intL
  
}

# 合并所有结果
do.call(rbind, l)

在列表 l 中,您可以找到每个步骤的计算,通过调用 do.call,您可以将结果合并成相同的 tibble。

鉴于您的数据集很大,可能在结束循环之前会遇到内存问题。在这种情况下,您可以将中间结果保存到文件中(每 n 批次保存结果到一个文件/清空列表/继续)。所有部分结果最后可以合并。

或者,您可以尝试构建一个与预期结果具有相同行数和列数的虚拟数据框,并在每次迭代之后替换值。这种方法可能较慢。

loop{

df[ind, ] <- geocode(df[ind,], ...)

}
英文:

Based on your latest edit you can save intermediate steps into a list and, then, join the final results into a tibble. Something like this:

# Batches to be considered at each iteration
ind  &lt;- 1:4 # batches of 4 rows
intL &lt;- length(ind)
nr   &lt;- nrow(df)

# List to allocate results
l &lt;- list()
lind &lt;- 1 

# Loop
continue &lt;- TRUE
while(continue){
  
  if(nr %in% ind){
    ind &lt;- ind[ind &lt;= nr]
    continue &lt;- FALSE
  }
  
  l[[lind]] &lt;- df[ind,] %&gt;% 
    geocode(street = num_street, city = city, state = state, postalcode = zip_code, 
            method = &quot;census&quot;, full_results = TRUE, api_options = list(census_return_type = &#39;geographies&#39;))
  
  lind &lt;- lind + 1 
  ind &lt;-  ind  + intL
  
}

# Join all results
do.call(rbind, l)

In the list l you have the computations of each step, calling do.call you join the results in the same tibble.

Given the large size of your dataset, you could run in memory issues before ending the loop. In this case you could save intermediate results to files (each n batches save the results to a file / empty the list / continue). All partial results can be joined in the end.

Alternatively, you can try to build a dummy df of the same number of rows and columns as the expected one and substitute the values after each iteration. This approach may be slower.

loop{

df[ind, ] &lt;- geocode(df[ind,], ...)

}

huangapple
  • 本文由 发表于 2023年5月31日 23:24:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76375083.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定