2023年7月14日 04:50:48go评论65阅读模式

英文:

Extracting a data frame filtered by country names from a larger data frame in R

问题

我有一个大型（约9500行）的混乱数据集，其中包含国家名称以及几个变量和数值输出。我已经创建了一个数据框的示例，如下所示：

country1 <- c("Arab World", "Caribbean small states", "Central Europe and the Baltics", "Australia", "Brazil", "Sweden")
indicator1 <- c("Age at first marriage, female", "Age at first marriage, male", "Birth rate, crude (per 1,000 people)", "Death rate, crude (per 1,000 people)", "Fertility rate, total (births per woman)", "Hospital beds (per 1,000 people)")
year1 <- c(1960, 1961, 1962, 1963, 1964, 1965)

test <- data.frame(country=country1, indicator=indicator1, year=year1)

我需要从中提取一个较小的数据框，仅包含国家名称，例如“Sweden”，并且不包括多个国家的聚合，例如“Central Europe”。

在这个问题中，你可以使用以下方法创建一个包含所有可能国家名称的新数据框，并进行左连接以过滤所需的数据：

# 创建包含所有可能国家名称的数据框
all_countries <- data.frame(country = unique(test$country))

# 进行左连接以过滤数据
filtered_data <- merge(test, all_countries, by = "country", all.x = TRUE)

# 筛选出不包括"Central Europe"等聚合的数据
filtered_data <- filtered_data[!grepl("Central Europe", filtered_data$country), ]

# 打印结果
print(filtered_data)

这将创建一个包含所有可能国家名称的数据框all_countries，然后使用左连接将它与原始数据框test连接，最后筛选出不包括聚合国家的数据。

英文:

I have a large (circa 9500 rows) untidy dataset containing country names along with several variables and numerical output. I've made an example of the data frame as such:

country1 &lt;- c(&quot;Arab World&quot;, &quot;Caribbean small states&quot;, &quot;Central Europe and the Baltics&quot;, &quot;Australia&quot;, &quot;Brazil&quot;, &quot;Sweden&quot;)
indicator1 &lt;- c(&quot;Age at first marriage, female&quot;, &quot;Age at first marriage, male&quot;, &quot;Birth rate, crude (per 1,000 people)&quot;, &quot;Death rate, crude (per 1,000 people)&quot;, &quot;Fertility rate, total (births per woman)&quot;, &quot;Hospital beds (per 1,000 people)&quot;)
year1 &lt;- c(1960,1961,1962,1963,1964,1965)

test &lt;- data.frame(country=country1, indicator=indicator1, year=year1)

I need to extract a smaller data frame from this, that is filtered by only country names, e.g. "Sweden" and does not include agglomerations of countries, e.g. "Central Europe".

Would appreciate any assistance in this matter. I am quite new to R so not really sure where to begin, but I would imagine that I would first need to create a new data frame containing rows of all possible country names and then do a left join with my above test data frame. How would I go about getting that initial df of all countries?

Thanks.

答案1

得分: 1

您可以创建您自己的有效国家名称列表，或尝试从{maps}包中提取一个：

```r
library(maps)
x <- map("world", plot = FALSE)
country_list <- x$names

我建议手动检查以确保此列表对您的数据足够更新。

然后根据这个国家列表进行子集筛选：

test_countries <- test[test$country %in% country_list, ]

得到：

    country                                indicator year
4 Australia     每千人的粗死亡率 1963
5    Brazil 总生育率（每位妇女的出生数） 1964
6    Sweden         每千人的医院床位数 1965


<details>
<summary>英文:</summary>

You can either create your own list of valid country names or try extracting one from the {maps} package:

```r
library(maps)
x &lt;- map(&quot;world&quot;, plot = FALSE)
country_list &lt;- x$names

I'd recommend manually inspecting to see if this list is up to date enough for your data.

Then subset based on this list of countries:

test_countries &lt;- test[test$country %in% country_list, ]

which gives:

    country                                indicator year
4 Australia     Death rate, crude (per 1,000 people) 1963
5    Brazil Fertility rate, total (births per woman) 1964
6    Sweden         Hospital beds (per 1,000 people) 1965

答案2

得分: 1

Alternatively use map_df which generates the subset in a df or simply use the 'filter'

assume that you have a 'nam' vector with different countries as elements which you can use to subset the df

library(tidyverse)

nam <- c('Brazil','Australia')

new_df <- map_df(nam, \(x) test %>% filter(country==x))

# or

new_df <- test %>% filter(country %in% nam)


    country                                indicator year
1    Brazil Fertility rate, total (births per woman) 1964
2 Australia     Death rate, crude (per 1,000 people) 1963

英文:

Alternatively use map_df which generates the subset in a df or simply use the 'filter'

assume that you have a nam vector with different countries as elements which you can use to subset the df

library(tidyverse)

nam &lt;- c(&#39;Brazil&#39;,&#39;Australia&#39;)

new_df &lt;- map_df(nam, \(x) test %&gt;% filter(country==x))

# or

new_df &lt;- test %&gt;% filter(country %in% nam)


    country                                indicator year
1    Brazil Fertility rate, total (births per woman) 1964
2 Australia     Death rate, crude (per 1,000 people) 1963

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从R中的一个较大数据框中提取按国家名称筛选的数据框。

问题

答案1

答案2

创建和填充一个数组

你可以在Dyplr的`rename_with()`函数的`.cols`参数中指定tibble的最后一列吗？

获取每年的最高、最低和平均值，放入一张表中。

使用Pandas中的`loc`方法忽略列表中的NaN元素。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论