如何从城市中获取国家的名称并将其保存在一个列中,使用 R 语言?

huangapple go评论64阅读模式
英文:

How to get the name of the country from the city and save in a column using r?

问题

我有这个数据集,它的格式如下:

  1. 城市,省/州,国家
  2. 城市
  3. 省/州
  4. 国家

现在,使用以下方法清理了数据集的大部分部分,并只保留了没有清理的部分。

首先,我尝试通过删除一些不必要的文本来清理

df <- df %>%
add_column(location_clean = trimws(str_replace_all(.$location, "Greater|Metropolitan|Metro|Area", "")),
.after = "location")

将位置分为地区、区域和国家。下一步将使用地区获取国家名称

df <- df %>%
separate(location_clean,
c("locale", "area1", "area2"), sep=",",
remove = FALSE, extra="merge", fill="right") %>%
mutate(locale = trimws(locale), area1 = trimws(area1), area2 = trimws(area2))

df <- 结构(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area",
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area",
"Greater Tampa Bay Area", "Greater Sacramento", "United States",
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area",
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area",
"Greater Hamburg Area", "United Arab Emirates", "Greater Stockholm Metropolitan Area",
"Raleigh-Durham-Chapel Hill Area", "Slovakia", "Latvia", "Greater Sydney Area",
"Sweden", "Romania", "Canada", "United States", "Metro Jacksonville",
"Metro Jacksonville", "Greater Tampa Bay Area", "Singapore",
"Greater Bolzano Metropolitan Area", "Greater Scranton Area"),
locale = c("United States", "Israel", "Stockholm", "Chicago",
"United States", "Minneapolis-St. Paul", "Tampa Bay", "Sacramento",
"United States", "Atlanta", "Kansas City", "Kansas City",
"Charlotte", "San Diego", "Hamburg", "United Arab Emirates",
"Stockholm", "Raleigh-Durham-Chapel Hill", "Slovakia", "Latvia",
"Sydney", "Sweden", "Romania", "Canada", "United States",
"Jacksonville", "Jacksonville", "Tampa Bay", "Singapore",
"Bolzano", "Scranton"), area1 = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_), area2 = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), country = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_
)), row.names = c(NA, -31L), class = c("tbl_df", "tbl", "data.frame"))

现在,df 包括原始位置和从位置列分离出的地区。我不知道如何从locale列(或location列)中获取城市和国家名。

我尝试使用maps包中的world.cities,但当我使用mutate函数从locale列中获取城市名称来填充country列时,出现了一些问题。

例如:当我运行以下代码时,我得到不一致的结果。

get_country_name <- function(city){

w <- world.cities %>%
filter(name == city) %>%
arrange(desc(pop)) %>% head(1) %>% pull(country.etc)

return(w)
}

df %>%
mutate(country = ifelse(is.na(country), unlist(sapply(locale, get_country_name)), country))

另一个问题是,可能会有多个国家具有相同的城市名称。我正在采用的一种方法是按人口选择国家,基本上是做出一个假设,即仅有来自大城市的人在数据库中。如果有更好的方法,那将会很好。

期望的结果是带有正确国家名称的数据框。

df <- 结构(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area",
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area",
"Greater Tampa Bay Area", "Greater Sacramento", "United States",
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area",
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area",
"Greater Hamburg Area", "United Arab Emirates"), locale = c("United States",
"Israel", "Sweden", "United States", "United States", "United States", "United States",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "Germany",
"United Arab Emirates")), row.names = c(NA, -16L), class = c("tbl_df",
"tbl", "data.frame"))

英文:

I have this dataset which consists of location in the following format:

1. city, province/state, country
2. city
3. province/state
4. country

Now, much of the dataset is cleaned using the following method, and I only took the ones that doesn't have the clean set.

## first, I tried to clean by removing some texts that are not helpful

df &lt;- df%&gt;% 
  add_column(location_clean = trimws(str_replace_all(.$location, &quot;Greater|Metropolitan|Metro|Area&quot;, &quot;&quot;)), 
             .after = &quot;location&quot;)


## separate the location into locale, area, and country. locale will be used to gather the country names in the next step
df &lt;- df%&gt;% 
  separate(location_clean, 
           c(&quot;locale&quot;, &quot;area1&quot;,&quot;area2&quot;), sep=&quot;,&quot;, 
           remove = FALSE, extra=&quot;merge&quot;, fill=&quot;right&quot;) %&gt;%
  mutate(locale = trimws(locale), area1 = trimws(area1), area2 = trimws(area2))



    df &lt;- structure(list(location = c(&quot;United States&quot;, &quot;Israel&quot;, &quot;Greater Stockholm Metropolitan Area&quot;, 
&quot;Greater Chicago Area&quot;, &quot;United States&quot;, &quot;Greater Minneapolis-St. Paul Area&quot;, 
&quot;Greater Tampa Bay Area&quot;, &quot;Greater Sacramento&quot;, &quot;United States&quot;, 
&quot;Atlanta Metropolitan Area&quot;, &quot;Kansas City Metropolitan Area&quot;, 
&quot;Kansas City Metropolitan Area&quot;, &quot;Charlotte Metro&quot;, &quot;San Diego Metropolitan Area&quot;, 
&quot;Greater Hamburg Area&quot;, &quot;United Arab Emirates&quot;, &quot;Greater Stockholm Metropolitan Area&quot;, 
&quot;Raleigh-Durham-Chapel Hill Area&quot;, &quot;Slovakia&quot;, &quot;Latvia&quot;, &quot;Greater Sydney Area&quot;, 
&quot;Sweden&quot;, &quot;Romania&quot;, &quot;Canada&quot;, &quot;United States&quot;, &quot;Metro Jacksonville&quot;, 
&quot;Metro Jacksonville&quot;, &quot;Greater Tampa Bay Area&quot;, &quot;Singapore&quot;, 
&quot;Greater Bolzano Metropolitan Area&quot;, &quot;Greater Scranton Area&quot;), 
    locale = c(&quot;United States&quot;, &quot;Israel&quot;, &quot;Stockholm&quot;, &quot;Chicago&quot;, 
    &quot;United States&quot;, &quot;Minneapolis-St. Paul&quot;, &quot;Tampa Bay&quot;, &quot;Sacramento&quot;, 
    &quot;United States&quot;, &quot;Atlanta&quot;, &quot;Kansas City&quot;, &quot;Kansas City&quot;, 
    &quot;Charlotte&quot;, &quot;San Diego&quot;, &quot;Hamburg&quot;, &quot;United Arab Emirates&quot;, 
    &quot;Stockholm&quot;, &quot;Raleigh-Durham-Chapel Hill&quot;, &quot;Slovakia&quot;, &quot;Latvia&quot;, 
    &quot;Sydney&quot;, &quot;Sweden&quot;, &quot;Romania&quot;, &quot;Canada&quot;, &quot;United States&quot;, 
    &quot;Jacksonville&quot;, &quot;Jacksonville&quot;, &quot;Tampa Bay&quot;, &quot;Singapore&quot;, 
    &quot;Bolzano&quot;, &quot;Scranton&quot;), area1 = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), area2 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), country = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    )), row.names = c(NA, -31L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;
))

The df now consists of the original location and locale separated from the location column. I'm lost at how to get the country name from the city and country name in the locale column (or location column).

I have tried using world.cities from the maps package, but something is not working correctly when I mutate the country column to get the country name from city names in the locale column.

For example: when I run the following code, I get inconsistent results.

get_country_name &lt;- function(city){
  
  w &lt;- world.cities %&gt;% 
    filter(name == city) %&gt;% 
    arrange(desc(pop)) %&gt;% head(1) %&gt;% pull(country.etc)
  
  return(w)
}

df %&gt;% 
  mutate(country = ifelse(is.na(country), unlist(sapply(locale, get_country_name)), country))

Another issue is that there may be multiple countries with the same city name. One approach I'm taking is to select the country by the population, essentially, making an assumption that only people from big cities are in the database. If there's a better way to do this, that'd be great.

Output

The expected result is the dataframe with the correct country name filled.

df &lt;- structure(list(location = c(&quot;United States&quot;, &quot;Israel&quot;, &quot;Greater Stockholm Metropolitan Area&quot;, 
&quot;Greater Chicago Area&quot;, &quot;United States&quot;, &quot;Greater Minneapolis-St. Paul Area&quot;, 
&quot;Greater Tampa Bay Area&quot;, &quot;Greater Sacramento&quot;, &quot;United States&quot;, 
&quot;Atlanta Metropolitan Area&quot;, &quot;Kansas City Metropolitan Area&quot;, 
&quot;Kansas City Metropolitan Area&quot;, &quot;Charlotte Metro&quot;, &quot;San Diego Metropolitan Area&quot;, 
&quot;Greater Hamburg Area&quot;, &quot;United Arab Emirates&quot;), locale = c(&quot;United States&quot;, 
&quot;Israel&quot;, &quot;Stockholm&quot;, &quot;Chicago&quot;, &quot;United States&quot;, &quot;Minneapolis-St. Paul&quot;, 
&quot;Tampa Bay&quot;, &quot;Sacramento&quot;, &quot;United States&quot;, &quot;Atlanta&quot;, &quot;Kansas City&quot;, 
&quot;Kansas City&quot;, &quot;Charlotte&quot;, &quot;San Diego&quot;, &quot;Hamburg&quot;, &quot;United Arab Emirates&quot;
), area1 = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), area2 = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    country = c(&quot;United States&quot;, &quot;Israel&quot;, &quot;Sweden&quot;, 
    &quot;United States&quot;, &quot;United States&quot;, &quot;United States&quot;, &quot;United States&quot;, 
    &quot;United States&quot;, &quot;United States&quot;, &quot;United States&quot;, &quot;United States&quot;, 
    &quot;United States&quot;, &quot;United States&quot;, &quot;United States&quot;, &quot;Germany&quot;, 
    &quot;United Arab Emirates&quot;)), row.names = c(NA, -16L), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;))

答案1

得分: 1

使用adist函数,使用来自City_and_province_list.csv的数据,该数据包含一份经过命令行修改的唯一国家列表。

此外,通过在城市名称中的短横线后面切割名称,使用sub(&quot;-.*&quot;, &quot;&quot;, x)来进一步简化城市名称。

请注意,你使用的是"Tampa Bay",但应该是"Tampa"。这就是为什么它与"Mbamba Bay, Tanzania"相关联的原因。

仅显示localecountry两列。

c_c <- read.csv("City_and_province_list.mod", header = TRUE, sep = "\t")

c_c <- rbind(c_c, c("United States", "United States of America"))

cbind(df[,1:2], 
  country = sapply(df$locale, function(x) 
    c_c$Country[which.min(adist(sub("&quot;-.*&quot;", "&quot;&quot;", x), c_c$City, 
      ignore.case = TRUE))]))[,2:3]

相同的方法,使用dplyr库:

library(dplyr)

df %>%
  rowwise() %>%
  mutate(country = c_c$Country[which.min(
    adist(sub("&quot;-.*&quot;", "&quot;&quot;", locale), c_c$City, ignore.case = TRUE))]) %>%
  ungroup()

这是你要求的代码部分的中文翻译。

英文:

With adist, using data from City_and_province_list.csv with a unique list of Countries (modified on command line).

Also further simplifying your cities by cutting off names after a dash, see sub(&quot;-.*&quot;, &quot;&quot;, x)

Note that you're using "Tampa Bay" but it should be "Tampa". That's why it's associated with "Mbamba Bay, Tanzania".

Only showing columns locale and country

c_c &lt;- read.csv(&quot;City_and_province_list.mod&quot;, header=T, sep=&quot;\t&quot;)

c_c &lt;- rbind(c_c, c(&quot;United States&quot;, &quot;United States of America&quot;))

cbind(df[,1:2], 
  country = sapply(df$locale, function(x) 
    c_c$Country[which.min(adist(sub(&quot;-.*&quot;, &quot;&quot;, x), c_c$City, 
      ignore.case = T))]))[,2:3]
                       locale                  country
1               United States United States of America
2                      Israel                   Israel
3                   Stockholm                   Sweden
4                     Chicago United States of America
5               United States United States of America
6        Minneapolis-St. Paul United States of America
7                   Tampa Bay                 Tanzania
8                  Sacramento United States of America
9               United States United States of America
10                    Atlanta United States of America
11                Kansas City United States of America
12                Kansas City United States of America
13                  Charlotte United States of America
14                  San Diego United States of America
15                    Hamburg                  Germany
16       United Arab Emirates     United Arab Emirates
17                  Stockholm                   Sweden
18 Raleigh-Durham-Chapel Hill United States of America
19                   Slovakia                 Slovakia
20                     Latvia                   Latvia
21                     Sydney                Australia
22                     Sweden                   Sweden
23                    Romania                  Romania
24                     Canada                   Canada
25              United States United States of America
26               Jacksonville United States of America
27               Jacksonville United States of America
28                  Tampa Bay                 Tanzania
29                  Singapore                Singapore
30                    Bolzano                    Italy
31                   Scranton United States of America

Same approach using dplyr

library(dplyr)

df %&gt;% 
  rowwise() %&gt;% 
  mutate(country = c_c$Country[which.min(
    adist(sub(&quot;-.*&quot;, &quot;&quot;, locale), c_c$City, ignore.case = T))]) %&gt;% 
  ungroup()
# A tibble: 31 &#215; 5
   location                            locale               area1 area2 country 
   &lt;chr&gt;                               &lt;chr&gt;                &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;   
 1 United States                       United States        NA    NA    United …
 2 Israel                              Israel               NA    NA    Israel  
 3 Greater Stockholm Metropolitan Area Stockholm            NA    NA    Sweden  
 4 Greater Chicago Area                Chicago              NA    NA    United …
 5 United States                       United States        NA    NA    United …
 6 Greater Minneapolis-St. Paul Area   Minneapolis-St. Paul NA    NA    United …
 7 Greater Tampa Bay Area              Tampa Bay            NA    NA    Tanzania
 8 Greater Sacramento                  Sacramento           NA    NA    United …
 9 United States                       United States        NA    NA    United …
10 Atlanta Metropolitan Area           Atlanta              NA    NA    United …
# … with 21 more rows

huangapple
  • 本文由 发表于 2023年3月3日 22:15:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75628201.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定