抽样数据,更新因子水平数量

huangapple go评论73阅读模式
英文:

Sampling Data, Update number of factor levels

问题

我有一个包含1000个观测和10列的给定数据集。数据可以在这里找到。每一列代表一个分类变量,具有不同数量的因子水平,并且被定义为这样。数据是这样从csv文件中导入的:

raw_data <- read.csv2(directory+name.csv, colClasses=c(rep('factor',10)), na.strings=c(""))

R正确地导入了数据,使用str(raw_data)可以显示每个变量有多少个因子水平。到目前为止一切都很完美。

我然后从这些数据中随机抽取了100个观测,并将它们保存到一个新的数据框中。

comp_data <- raw_data[sample(1:nrow(raw_data), 100), ]

现在出现的问题是,在抽样过程中可能出现这样的情况:一个特定因子水平的观测都没有被抽中。假设在完整数据集(raw_data)中,“Education”这个变量有5个因子水平。在1000个个体中,只有76个人的“Education”是“Partial Highschool”。所以在抽样时,有时可能没有一个这76个观测被选中。在抽样后的数据集(comp_data)中,现在只有4个水平。但是str(comp_data)显示数据仍然有5个因子水平。因子的数量没有被更新,尽管实际上只有4个水平而不是5个。

我需要一种方法在抽样后自动更新因子水平的数量。

问题后来出现在我计算变量对之间的列联系数和Cramer's V。当上述描述的情况发生时,这些函数会返回NA。

谢谢帮助。
Kevin

英文:

I have a given dataset with 1000 observations and 10 columns.
Data can be found here. Every column represents a categorical variable with differing numbers of factor levels and are defined as such.
Data is imported from csv like this:

raw_data &lt;-read.csv2(directory+name.csv,colClasses=c(rep(&#39;factor&#39;,10)),na.strings=c(&quot;&quot;))

R correctly imports the data and using str(raw_data) shows how many factor levels each variable has. Perfect so far.

抽样数据,更新因子水平数量

I then take a random sample of 100 obervations from this data and save them to a new dataframe.

comp_data &lt;- raw_data [sample(1:nrow(raw_data), 100), ]

The problem that now arises is, that while sampling it may happen that not a single observation with a specific factorlevel has been drawn.
Let´s say Education has 5 factor levels in the complete dataset (raw_data). Only 76 of the 1000 individuals have a "Partial Highschool" as Education. So it´s reasonable to assume, that sometimes during sampling not a single of those 76 observation will get drawn. In the sampled dataset (comp_data) there are now only 4 levels present in the data. But str(comp_data) shows that there are still 5 factorlevels. The number of factors has not been updated even though in reality there now only 4 levels and not 5.

I need a way to automatically update the number of factorlevels after sampling.

The issue later on is that I calculate Contingency Coefficients and CRAMER´s V between pairs of variables. Those functions give back an NA when the above described happens.

Thanks for the help.
Kevin

答案1

得分: 1

以下是翻译好的部分:

问题在于,R 无法知道因子应该具有哪些水平,除非提供信息。为了解决这个问题,创建一个水平列表,其中元素名称对应于数据的变量名称。您可以将其应用于要下载的每个数据版本。

lev.lst <- list(Marital.Status=c("已婚", "单身", "离异", '丧偶'),
                Gender=c("女性", "男性"), 
                Children=as.character(1:10), 
                Education=c("部分高中", "高中", "部分大学", "学士", "硕士", "研究生学位"), 
                Occupation=c("熟练工人", "办事员", "专业人士", "手工", "管理"), 
                Home.Owner=c("是", "否"), 
                Cars=as.character(1:5), 
                Commute.Distance=c("0-1 英里", "1-2 英里", "2-5 英里", "5-10 英里", "10+ 英里"), 
                Region=c("欧洲", "太平洋", "北美洲", "亚洲", "非洲"), 
                Purchased.Bike=c("否", "是"))

接下来,使用 read.csv,并将因子的 colClasses 设置为 'character'

dat <- read.csv('https://pastebin.com/raw/ut447XdE', sep=';', colClasses='character')

现在,为因子变量的名称创建一个向量。

facs <- names(lev.lst)

最后,在 Map 中使用 factor 函数,对数据框的 facs 使用 lev.lst 中包含的相应信息作为 levels= 参数。

dat[facs] <- Map(factor, dat[facs], lev.lst)

这将给出以下结果:

str(dat)
# 10 个因子变量
#  $ Marital.Status  : Factor w/ 4 levels "已婚","单身",..: 1 1 1 2 2 1 2 1 1 1 ...
#  $ Gender          : Factor w/ 2 levels "女性","男性": 1 2 2 2 2 1 2 2 2 2 ...
#  $ Children        : Factor w/ 10 levels "1","2","3","4",..: 1 3 5 NA NA 2 2 1 2 2 ...
#  $ Education       : Factor w/ 6 levels "部分高中",..: 4 3 3 4 4 3 2 4 1 3 ...
#  $ Occupation      : Factor w/ 5 levels "熟练工人",..: 1 2 3 3 2 4 5 1 2 4 ...
#  $ Home.Owner      : Factor w/ 2 levels "是","否": 1 1 2 1 2 1 1 1 1 1 ...
#  $ Cars            : Factor w/ 5 levels "1","2","3","4",..: NA 1 2 1 NA NA 4 NA 2 1 ...
#  $ Commute.Distance: Factor w/ 5 levels "0-1 英里","1-2 英里",..: 1 1 3 4 1 2 1 1 4 1 ...
#  $ Region          : Factor w/ 5 levels "欧洲","太平洋",..: 1 1 1 2 1 1 2 1 2 1 ...
#  $ Purchased.Bike  : Factor w/ 2 levels "否","是": 1 1 1 2 2 1 2 2 1 2 ...

希望这可以帮助您进行数据处理。

英文:

The problem is, that R can't know which levels the factors are supposed to have without providing the information. To solve this, create a levels list with the element names corresponding to the variable names of the data. You can apply it to each version of the data to be downloaded.

lev.lst &lt;- list(Marital.Status=c(&quot;Married&quot;, &quot;Single&quot;, &quot;Divorced&quot;, &#39;Widowed&#39;),
                Gender=c(&quot;Female&quot;, &quot;Male&quot;), 
                Children=as.character(1:10), 
                Education=c(&quot;Partial High School&quot;, &quot;High School&quot;, &quot;Partial College&quot;, &quot;Bachelors&quot;, &quot;Masters&quot;, &quot;Graduate Degree&quot;), 
                Occupation=c(&quot;Skilled Manual&quot;, &quot;Clerical&quot;, &quot;Professional&quot;, &quot;Manual&quot;, &quot;Management&quot;), 
                Home.Owner=c(&quot;Yes&quot;, &quot;No&quot;), 
                Cars=as.character(1:5), 
                Commute.Distance=c(&quot;0-1 Miles&quot;, &quot;1-2 Miles&quot;, &quot;2-5 Miles&quot;, &quot;5-10 Miles&quot;, &quot;10+ Miles&quot;), 
                Region=c(&quot;Europe&quot;, &quot;Pacific&quot;, &quot;North America&quot;, &quot;Asia&quot;, &quot;Africa&quot;), 
                Purchased.Bike=c(&quot;No&quot;, &quot;Yes&quot;))

Next, use read.csv with colClasses of the factors as 'character'`.

dat &lt;- read.csv(&#39;https://pastebin.com/raw/ut447XdE&#39;, sep=&#39;;&#39;, colClasses=&#39;character&#39;)

Now, to avoid a mess, create a vector with the names of the factor variables.

facs &lt;- names(lev.lst)

Finally, in Map use the factor function on the facs of your data frame, with lev.lst containing the respective information for the levels= argument.

dat[facs] &lt;- Map(factor, dat[facs], lev.lst)

Gives

str(dat)
# List of 10
#  $ Marital.Status  : Factor w/ 4 levels &quot;Married&quot;,&quot;Single&quot;,..: 1 1 1 2 2 1 2 1 1 1 ...
#  $ Gender          : Factor w/ 2 levels &quot;Female&quot;,&quot;Male&quot;: 1 2 2 2 2 1 2 2 2 2 ...
#  $ Children        : Factor w/ 10 levels &quot;1&quot;,&quot;2&quot;,&quot;3&quot;,&quot;4&quot;,..: 1 3 5 NA NA 2 2 1 2 2 ...
#  $ Education       : Factor w/ 6 levels &quot;Partial High School&quot;,..: 4 3 3 4 4 3 2 4 1 3 ...
#  $ Occupation      : Factor w/ 5 levels &quot;Skilled Manual&quot;,..: 1 2 3 3 2 4 5 1 2 4 ...
#  $ Home.Owner      : Factor w/ 2 levels &quot;Yes&quot;,&quot;No&quot;: 1 1 2 1 2 1 1 1 1 1 ...
#  $ Cars            : Factor w/ 5 levels &quot;1&quot;,&quot;2&quot;,&quot;3&quot;,&quot;4&quot;,..: NA 1 2 1 NA NA 4 NA 2 1 ...
#  $ Commute.Distance: Factor w/ 5 levels &quot;0-1 Miles&quot;,&quot;1-2 Miles&quot;,..: 1 1 3 4 1 2 1 1 4 1 ...
#  $ Region          : Factor w/ 5 levels &quot;Europe&quot;,&quot;Pacific&quot;,..: 1 1 1 2 1 1 2 1 2 1 ...
#  $ Purchased.Bike  : Factor w/ 2 levels &quot;No&quot;,&quot;Yes&quot;: 1 1 1 2 2 1 2 2 1 2 ...

答案2

得分: 0

我找到了一个更简短和不那么繁琐的答案。
droplevels()命令将从数据框中删除所有未使用的级别。
应用:

df <- droplevels(df)
英文:

I´ve found an answer my self that is much shorter and less cumbersome.
The droplevels() command will drop all unused levels from a dataframe.
Application:

df &lt;- droplevels(df)

huangapple
  • 本文由 发表于 2023年5月21日 03:31:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76297016.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定