英文:
How to get the name of the country from the city and save in a column using r?
问题
我有这个数据集,它的格式如下:
- 城市,省/州,国家
- 城市
- 省/州
- 国家
现在,使用以下方法清理了数据集的大部分部分,并只保留了没有清理的部分。
首先,我尝试通过删除一些不必要的文本来清理
df <- df %>%
add_column(location_clean = trimws(str_replace_all(.$location, "Greater|Metropolitan|Metro|Area", "")),
.after = "location")
将位置分为地区、区域和国家。下一步将使用地区获取国家名称
df <- df %>%
separate(location_clean,
c("locale", "area1", "area2"), sep=",",
remove = FALSE, extra="merge", fill="right") %>%
mutate(locale = trimws(locale), area1 = trimws(area1), area2 = trimws(area2))
df <- 结构(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area",
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area",
"Greater Tampa Bay Area", "Greater Sacramento", "United States",
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area",
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area",
"Greater Hamburg Area", "United Arab Emirates", "Greater Stockholm Metropolitan Area",
"Raleigh-Durham-Chapel Hill Area", "Slovakia", "Latvia", "Greater Sydney Area",
"Sweden", "Romania", "Canada", "United States", "Metro Jacksonville",
"Metro Jacksonville", "Greater Tampa Bay Area", "Singapore",
"Greater Bolzano Metropolitan Area", "Greater Scranton Area"),
locale = c("United States", "Israel", "Stockholm", "Chicago",
"United States", "Minneapolis-St. Paul", "Tampa Bay", "Sacramento",
"United States", "Atlanta", "Kansas City", "Kansas City",
"Charlotte", "San Diego", "Hamburg", "United Arab Emirates",
"Stockholm", "Raleigh-Durham-Chapel Hill", "Slovakia", "Latvia",
"Sydney", "Sweden", "Romania", "Canada", "United States",
"Jacksonville", "Jacksonville", "Tampa Bay", "Singapore",
"Bolzano", "Scranton"), area1 = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_), area2 = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), country = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_
)), row.names = c(NA, -31L), class = c("tbl_df", "tbl", "data.frame"))
现在,df 包括原始位置和从位置列分离出的地区。我不知道如何从locale
列(或location
列)中获取城市和国家名。
我尝试使用maps
包中的world.cities
,但当我使用mutate
函数从locale
列中获取城市名称来填充country
列时,出现了一些问题。
例如:当我运行以下代码时,我得到不一致的结果。
get_country_name <- function(city){
w <- world.cities %>%
filter(name == city) %>%
arrange(desc(pop)) %>% head(1) %>% pull(country.etc)
return(w)
}
df %>%
mutate(country = ifelse(is.na(country), unlist(sapply(locale, get_country_name)), country))
另一个问题是,可能会有多个国家具有相同的城市名称。我正在采用的一种方法是按人口选择国家,基本上是做出一个假设,即仅有来自大城市的人在数据库中。如果有更好的方法,那将会很好。
期望的结果是带有正确国家名称的数据框。
df <- 结构(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area",
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area",
"Greater Tampa Bay Area", "Greater Sacramento", "United States",
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area",
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area",
"Greater Hamburg Area", "United Arab Emirates"), locale = c("United States",
"Israel", "Sweden", "United States", "United States", "United States", "United States",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "Germany",
"United Arab Emirates")), row.names = c(NA, -16L), class = c("tbl_df",
"tbl", "data.frame"))
英文:
I have this dataset which consists of location in the following format:
1. city, province/state, country
2. city
3. province/state
4. country
Now, much of the dataset is cleaned using the following method, and I only took the ones that doesn't have the clean set.
## first, I tried to clean by removing some texts that are not helpful
df <- df%>%
add_column(location_clean = trimws(str_replace_all(.$location, "Greater|Metropolitan|Metro|Area", "")),
.after = "location")
## separate the location into locale, area, and country. locale will be used to gather the country names in the next step
df <- df%>%
separate(location_clean,
c("locale", "area1","area2"), sep=",",
remove = FALSE, extra="merge", fill="right") %>%
mutate(locale = trimws(locale), area1 = trimws(area1), area2 = trimws(area2))
df <- structure(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area",
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area",
"Greater Tampa Bay Area", "Greater Sacramento", "United States",
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area",
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area",
"Greater Hamburg Area", "United Arab Emirates", "Greater Stockholm Metropolitan Area",
"Raleigh-Durham-Chapel Hill Area", "Slovakia", "Latvia", "Greater Sydney Area",
"Sweden", "Romania", "Canada", "United States", "Metro Jacksonville",
"Metro Jacksonville", "Greater Tampa Bay Area", "Singapore",
"Greater Bolzano Metropolitan Area", "Greater Scranton Area"),
locale = c("United States", "Israel", "Stockholm", "Chicago",
"United States", "Minneapolis-St. Paul", "Tampa Bay", "Sacramento",
"United States", "Atlanta", "Kansas City", "Kansas City",
"Charlotte", "San Diego", "Hamburg", "United Arab Emirates",
"Stockholm", "Raleigh-Durham-Chapel Hill", "Slovakia", "Latvia",
"Sydney", "Sweden", "Romania", "Canada", "United States",
"Jacksonville", "Jacksonville", "Tampa Bay", "Singapore",
"Bolzano", "Scranton"), area1 = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), area2 = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), country = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
)), row.names = c(NA, -31L), class = c("tbl_df", "tbl", "data.frame"
))
The df now consists of the original location and locale separated from the location column. I'm lost at how to get the country name from the city and country name in the locale
column (or location
column).
I have tried using world.cities
from the maps
package, but something is not working correctly when I mutate
the country
column to get the country name from city names in the locale
column.
For example: when I run the following code, I get inconsistent results.
get_country_name <- function(city){
w <- world.cities %>%
filter(name == city) %>%
arrange(desc(pop)) %>% head(1) %>% pull(country.etc)
return(w)
}
df %>%
mutate(country = ifelse(is.na(country), unlist(sapply(locale, get_country_name)), country))
Another issue is that there may be multiple countries with the same city name. One approach I'm taking is to select the country by the population, essentially, making an assumption that only people from big cities are in the database. If there's a better way to do this, that'd be great.
Output
The expected result is the dataframe with the correct country name filled.
df <- structure(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area",
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area",
"Greater Tampa Bay Area", "Greater Sacramento", "United States",
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area",
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area",
"Greater Hamburg Area", "United Arab Emirates"), locale = c("United States",
"Israel", "Stockholm", "Chicago", "United States", "Minneapolis-St. Paul",
"Tampa Bay", "Sacramento", "United States", "Atlanta", "Kansas City",
"Kansas City", "Charlotte", "San Diego", "Hamburg", "United Arab Emirates"
), area1 = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), area2 = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
country = c("United States", "Israel", "Sweden",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "Germany",
"United Arab Emirates")), row.names = c(NA, -16L), class = c("tbl_df",
"tbl", "data.frame"))
答案1
得分: 1
使用adist
函数,使用来自City_and_province_list.csv的数据,该数据包含一份经过命令行修改的唯一国家列表。
此外,通过在城市名称中的短横线后面切割名称,使用sub("-.*", "", x)
来进一步简化城市名称。
请注意,你使用的是"Tampa Bay",但应该是"Tampa"。这就是为什么它与"Mbamba Bay, Tanzania"相关联的原因。
仅显示locale和country两列。
c_c <- read.csv("City_and_province_list.mod", header = TRUE, sep = "\t")
c_c <- rbind(c_c, c("United States", "United States of America"))
cbind(df[,1:2],
country = sapply(df$locale, function(x)
c_c$Country[which.min(adist(sub(""-.*"", """", x), c_c$City,
ignore.case = TRUE))]))[,2:3]
相同的方法,使用dplyr
库:
library(dplyr)
df %>%
rowwise() %>%
mutate(country = c_c$Country[which.min(
adist(sub(""-.*"", """", locale), c_c$City, ignore.case = TRUE))]) %>%
ungroup()
这是你要求的代码部分的中文翻译。
英文:
With adist
, using data from City_and_province_list.csv with a unique list of Countries (modified on command line).
Also further simplifying your cities by cutting off names after a dash, see sub("-.*", "", x)
Note that you're using "Tampa Bay" but it should be "Tampa". That's why it's associated with "Mbamba Bay, Tanzania".
Only showing columns locale and country
c_c <- read.csv("City_and_province_list.mod", header=T, sep="\t")
c_c <- rbind(c_c, c("United States", "United States of America"))
cbind(df[,1:2],
country = sapply(df$locale, function(x)
c_c$Country[which.min(adist(sub("-.*", "", x), c_c$City,
ignore.case = T))]))[,2:3]
locale country
1 United States United States of America
2 Israel Israel
3 Stockholm Sweden
4 Chicago United States of America
5 United States United States of America
6 Minneapolis-St. Paul United States of America
7 Tampa Bay Tanzania
8 Sacramento United States of America
9 United States United States of America
10 Atlanta United States of America
11 Kansas City United States of America
12 Kansas City United States of America
13 Charlotte United States of America
14 San Diego United States of America
15 Hamburg Germany
16 United Arab Emirates United Arab Emirates
17 Stockholm Sweden
18 Raleigh-Durham-Chapel Hill United States of America
19 Slovakia Slovakia
20 Latvia Latvia
21 Sydney Australia
22 Sweden Sweden
23 Romania Romania
24 Canada Canada
25 United States United States of America
26 Jacksonville United States of America
27 Jacksonville United States of America
28 Tampa Bay Tanzania
29 Singapore Singapore
30 Bolzano Italy
31 Scranton United States of America
Same approach using dplyr
library(dplyr)
df %>%
rowwise() %>%
mutate(country = c_c$Country[which.min(
adist(sub("-.*", "", locale), c_c$City, ignore.case = T))]) %>%
ungroup()
# A tibble: 31 × 5
location locale area1 area2 country
<chr> <chr> <chr> <chr> <chr>
1 United States United States NA NA United …
2 Israel Israel NA NA Israel
3 Greater Stockholm Metropolitan Area Stockholm NA NA Sweden
4 Greater Chicago Area Chicago NA NA United …
5 United States United States NA NA United …
6 Greater Minneapolis-St. Paul Area Minneapolis-St. Paul NA NA United …
7 Greater Tampa Bay Area Tampa Bay NA NA Tanzania
8 Greater Sacramento Sacramento NA NA United …
9 United States United States NA NA United …
10 Atlanta Metropolitan Area Atlanta NA NA United …
# … with 21 more rows
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论