从R中的一个较大数据框中提取按国家名称筛选的数据框。

huangapple go评论65阅读模式
英文:

Extracting a data frame filtered by country names from a larger data frame in R

问题

我有一个大型(约9500行)的混乱数据集,其中包含国家名称以及几个变量和数值输出。我已经创建了一个数据框的示例,如下所示:

country1 <- c("Arab World", "Caribbean small states", "Central Europe and the Baltics", "Australia", "Brazil", "Sweden")
indicator1 <- c("Age at first marriage, female", "Age at first marriage, male", "Birth rate, crude (per 1,000 people)", "Death rate, crude (per 1,000 people)", "Fertility rate, total (births per woman)", "Hospital beds (per 1,000 people)")
year1 <- c(1960, 1961, 1962, 1963, 1964, 1965)

test <- data.frame(country=country1, indicator=indicator1, year=year1)

我需要从中提取一个较小的数据框,仅包含国家名称,例如“Sweden”,并且不包括多个国家的聚合,例如“Central Europe”。

在这个问题中,你可以使用以下方法创建一个包含所有可能国家名称的新数据框,并进行左连接以过滤所需的数据:

# 创建包含所有可能国家名称的数据框
all_countries <- data.frame(country = unique(test$country))

# 进行左连接以过滤数据
filtered_data <- merge(test, all_countries, by = "country", all.x = TRUE)

# 筛选出不包括"Central Europe"等聚合的数据
filtered_data <- filtered_data[!grepl("Central Europe", filtered_data$country), ]

# 打印结果
print(filtered_data)

这将创建一个包含所有可能国家名称的数据框all_countries,然后使用左连接将它与原始数据框test连接,最后筛选出不包括聚合国家的数据。

英文:

I have a large (circa 9500 rows) untidy dataset containing country names along with several variables and numerical output. I've made an example of the data frame as such:

country1 &lt;- c(&quot;Arab World&quot;, &quot;Caribbean small states&quot;, &quot;Central Europe and the Baltics&quot;, &quot;Australia&quot;, &quot;Brazil&quot;, &quot;Sweden&quot;)
indicator1 &lt;- c(&quot;Age at first marriage, female&quot;, &quot;Age at first marriage, male&quot;, &quot;Birth rate, crude (per 1,000 people)&quot;, &quot;Death rate, crude (per 1,000 people)&quot;, &quot;Fertility rate, total (births per woman)&quot;, &quot;Hospital beds (per 1,000 people)&quot;)
year1 &lt;- c(1960,1961,1962,1963,1964,1965)

test &lt;- data.frame(country=country1, indicator=indicator1, year=year1)

I need to extract a smaller data frame from this, that is filtered by only country names, e.g. "Sweden" and does not include agglomerations of countries, e.g. "Central Europe".

Would appreciate any assistance in this matter. I am quite new to R so not really sure where to begin, but I would imagine that I would first need to create a new data frame containing rows of all possible country names and then do a left join with my above test data frame. How would I go about getting that initial df of all countries?

Thanks.

答案1

得分: 1

您可以创建您自己的有效国家名称列表,或尝试从{maps}包中提取一个:

```r
library(maps)
x <- map("world", plot = FALSE)
country_list <- x$names

我建议手动检查以确保此列表对您的数据足够更新。

然后根据这个国家列表进行子集筛选:

test_countries <- test[test$country %in% country_list, ]

得到:

    country                                indicator year
4 Australia     每千人的粗死亡率 1963
5    Brazil 总生育率(每位妇女的出生数) 1964
6    Sweden         每千人的医院床位数 1965

<details>
<summary>英文:</summary>

You can either create your own list of valid country names or try extracting one from the {maps} package:

```r
library(maps)
x &lt;- map(&quot;world&quot;, plot = FALSE)
country_list &lt;- x$names

I'd recommend manually inspecting to see if this list is up to date enough for your data.

Then subset based on this list of countries:

test_countries &lt;- test[test$country %in% country_list, ]

which gives:

    country                                indicator year
4 Australia     Death rate, crude (per 1,000 people) 1963
5    Brazil Fertility rate, total (births per woman) 1964
6    Sweden         Hospital beds (per 1,000 people) 1965

答案2

得分: 1

Alternatively use map_df which generates the subset in a df or simply use the 'filter'

assume that you have a 'nam' vector with different countries as elements which you can use to subset the df

library(tidyverse)

nam <- c('Brazil','Australia')

new_df <- map_df(nam, \(x) test %>% filter(country==x))

# or

new_df <- test %>% filter(country %in% nam)


    country                                indicator year
1    Brazil Fertility rate, total (births per woman) 1964
2 Australia     Death rate, crude (per 1,000 people) 1963

英文:

Alternatively use map_df which generates the subset in a df or simply use the 'filter'

assume that you have a nam vector with different countries as elements which you can use to subset the df

library(tidyverse)

nam &lt;- c(&#39;Brazil&#39;,&#39;Australia&#39;)

new_df &lt;- map_df(nam, \(x) test %&gt;% filter(country==x))

# or

new_df &lt;- test %&gt;% filter(country %in% nam)


    country                                indicator year
1    Brazil Fertility rate, total (births per woman) 1964
2 Australia     Death rate, crude (per 1,000 people) 1963

huangapple
  • 本文由 发表于 2023年7月14日 04:50:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76683155.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定