英文:
Sampling Data, Update number of factor levels
问题
我有一个包含1000个观测和10列的给定数据集。数据可以在这里找到。每一列代表一个分类变量,具有不同数量的因子水平,并且被定义为这样。数据是这样从csv文件中导入的:
raw_data <- read.csv2(directory+name.csv, colClasses=c(rep('factor',10)), na.strings=c(""))
R正确地导入了数据,使用str(raw_data)
可以显示每个变量有多少个因子水平。到目前为止一切都很完美。
我然后从这些数据中随机抽取了100个观测,并将它们保存到一个新的数据框中。
comp_data <- raw_data[sample(1:nrow(raw_data), 100), ]
现在出现的问题是,在抽样过程中可能出现这样的情况:一个特定因子水平的观测都没有被抽中。假设在完整数据集(raw_data
)中,“Education”这个变量有5个因子水平。在1000个个体中,只有76个人的“Education”是“Partial Highschool”。所以在抽样时,有时可能没有一个这76个观测被选中。在抽样后的数据集(comp_data
)中,现在只有4个水平。但是str(comp_data)
显示数据仍然有5个因子水平。因子的数量没有被更新,尽管实际上只有4个水平而不是5个。
我需要一种方法在抽样后自动更新因子水平的数量。
问题后来出现在我计算变量对之间的列联系数和Cramer's V。当上述描述的情况发生时,这些函数会返回NA。
谢谢帮助。
Kevin
英文:
I have a given dataset with 1000 observations and 10 columns.
Data can be found here. Every column represents a categorical variable with differing numbers of factor levels and are defined as such.
Data is imported from csv like this:
raw_data <-read.csv2(directory+name.csv,colClasses=c(rep('factor',10)),na.strings=c(""))
R correctly imports the data and using str(raw_data)
shows how many factor levels each variable has. Perfect so far.
I then take a random sample of 100 obervations from this data and save them to a new dataframe.
comp_data <- raw_data [sample(1:nrow(raw_data), 100), ]
The problem that now arises is, that while sampling it may happen that not a single observation with a specific factorlevel has been drawn.
Let´s say Education has 5 factor levels in the complete dataset (raw_data
). Only 76 of the 1000 individuals have a "Partial Highschool" as Education. So it´s reasonable to assume, that sometimes during sampling not a single of those 76 observation will get drawn. In the sampled dataset (comp_data
) there are now only 4 levels present in the data. But str(comp_data)
shows that there are still 5 factorlevels. The number of factors has not been updated even though in reality there now only 4 levels and not 5.
I need a way to automatically update the number of factorlevels after sampling.
The issue later on is that I calculate Contingency Coefficients and CRAMER´s V between pairs of variables. Those functions give back an NA when the above described happens.
Thanks for the help.
Kevin
答案1
得分: 1
以下是翻译好的部分:
问题在于,R 无法知道因子应该具有哪些水平,除非提供信息。为了解决这个问题,创建一个水平列表,其中元素名称对应于数据的变量名称。您可以将其应用于要下载的每个数据版本。
lev.lst <- list(Marital.Status=c("已婚", "单身", "离异", '丧偶'),
Gender=c("女性", "男性"),
Children=as.character(1:10),
Education=c("部分高中", "高中", "部分大学", "学士", "硕士", "研究生学位"),
Occupation=c("熟练工人", "办事员", "专业人士", "手工", "管理"),
Home.Owner=c("是", "否"),
Cars=as.character(1:5),
Commute.Distance=c("0-1 英里", "1-2 英里", "2-5 英里", "5-10 英里", "10+ 英里"),
Region=c("欧洲", "太平洋", "北美洲", "亚洲", "非洲"),
Purchased.Bike=c("否", "是"))
接下来,使用 read.csv
,并将因子的 colClasses
设置为 'character'
。
dat <- read.csv('https://pastebin.com/raw/ut447XdE', sep=';', colClasses='character')
现在,为因子变量的名称创建一个向量。
facs <- names(lev.lst)
最后,在 Map
中使用 factor
函数,对数据框的 facs
使用 lev.lst
中包含的相应信息作为 levels=
参数。
dat[facs] <- Map(factor, dat[facs], lev.lst)
这将给出以下结果:
str(dat)
# 10 个因子变量
# $ Marital.Status : Factor w/ 4 levels "已婚","单身",..: 1 1 1 2 2 1 2 1 1 1 ...
# $ Gender : Factor w/ 2 levels "女性","男性": 1 2 2 2 2 1 2 2 2 2 ...
# $ Children : Factor w/ 10 levels "1","2","3","4",..: 1 3 5 NA NA 2 2 1 2 2 ...
# $ Education : Factor w/ 6 levels "部分高中",..: 4 3 3 4 4 3 2 4 1 3 ...
# $ Occupation : Factor w/ 5 levels "熟练工人",..: 1 2 3 3 2 4 5 1 2 4 ...
# $ Home.Owner : Factor w/ 2 levels "是","否": 1 1 2 1 2 1 1 1 1 1 ...
# $ Cars : Factor w/ 5 levels "1","2","3","4",..: NA 1 2 1 NA NA 4 NA 2 1 ...
# $ Commute.Distance: Factor w/ 5 levels "0-1 英里","1-2 英里",..: 1 1 3 4 1 2 1 1 4 1 ...
# $ Region : Factor w/ 5 levels "欧洲","太平洋",..: 1 1 1 2 1 1 2 1 2 1 ...
# $ Purchased.Bike : Factor w/ 2 levels "否","是": 1 1 1 2 2 1 2 2 1 2 ...
希望这可以帮助您进行数据处理。
英文:
The problem is, that R can't know which levels the factors are supposed to have without providing the information. To solve this, create a levels list with the element names corresponding to the variable names of the data. You can apply it to each version of the data to be downloaded.
lev.lst <- list(Marital.Status=c("Married", "Single", "Divorced", 'Widowed'),
Gender=c("Female", "Male"),
Children=as.character(1:10),
Education=c("Partial High School", "High School", "Partial College", "Bachelors", "Masters", "Graduate Degree"),
Occupation=c("Skilled Manual", "Clerical", "Professional", "Manual", "Management"),
Home.Owner=c("Yes", "No"),
Cars=as.character(1:5),
Commute.Distance=c("0-1 Miles", "1-2 Miles", "2-5 Miles", "5-10 Miles", "10+ Miles"),
Region=c("Europe", "Pacific", "North America", "Asia", "Africa"),
Purchased.Bike=c("No", "Yes"))
Next, use read.csv
with colClasses
of the factors as 'character'`.
dat <- read.csv('https://pastebin.com/raw/ut447XdE', sep=';', colClasses='character')
Now, to avoid a mess, create a vector with the names of the factor variables.
facs <- names(lev.lst)
Finally, in Map
use the factor
function on the facs
of your data frame, with lev.lst
containing the respective information for the levels=
argument.
dat[facs] <- Map(factor, dat[facs], lev.lst)
Gives
str(dat)
# List of 10
# $ Marital.Status : Factor w/ 4 levels "Married","Single",..: 1 1 1 2 2 1 2 1 1 1 ...
# $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 2 2 2 ...
# $ Children : Factor w/ 10 levels "1","2","3","4",..: 1 3 5 NA NA 2 2 1 2 2 ...
# $ Education : Factor w/ 6 levels "Partial High School",..: 4 3 3 4 4 3 2 4 1 3 ...
# $ Occupation : Factor w/ 5 levels "Skilled Manual",..: 1 2 3 3 2 4 5 1 2 4 ...
# $ Home.Owner : Factor w/ 2 levels "Yes","No": 1 1 2 1 2 1 1 1 1 1 ...
# $ Cars : Factor w/ 5 levels "1","2","3","4",..: NA 1 2 1 NA NA 4 NA 2 1 ...
# $ Commute.Distance: Factor w/ 5 levels "0-1 Miles","1-2 Miles",..: 1 1 3 4 1 2 1 1 4 1 ...
# $ Region : Factor w/ 5 levels "Europe","Pacific",..: 1 1 1 2 1 1 2 1 2 1 ...
# $ Purchased.Bike : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 2 1 2 ...
答案2
得分: 0
我找到了一个更简短和不那么繁琐的答案。
droplevels()
命令将从数据框中删除所有未使用的级别。
应用:
df <- droplevels(df)
英文:
I´ve found an answer my self that is much shorter and less cumbersome.
The droplevels()
command will drop all unused levels from a dataframe.
Application:
df <- droplevels(df)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论