英文:
Confusion with the output of the function str
问题
在Baystate医疗中心(位于美国春田)于1986年收集的birth.csv
数据集具有以下格式
导入csv文件后(使用read.csv()
和colClasses
参数),str()
函数的输出与head()
函数不匹配。例如,列low
的前6个值应该是0,但str()
生成的示例输出显示它们为1
'data.frame': 189 obs. of 9 variables:
$ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... # 但它们应该是 0 0 0 0... 对吗?
$ age : num 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : num 182 155 105 108 107 124 118 103 123 113 ...
$ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
$ ptl : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
$ ftv : Factor w/ 3 levels "0","1","2": 1 3 2 3 1 1 2 2 2 1 ...
A data.frame: 6 × 9
low age lwt race smoke ptl ht ui ftv
<fct><dbl><dbl><fct><fct><fct><fct><fct><fct>
1 0 19 182 2 0 0 0 1 0
2 0 33 155 3 0 0 0 0 2
3 0 20 105 1 1 0 0 0 1
4 0 21 108 1 1 0 0 1 2
5 0 18 107 1 1 0 0 1 0
6 0 21 124 3 0 0 0 0 0
请问有人能解释一下发生了什么吗?如果我为导入的数据集构建了一个逻辑模型,结果会是错误的吗?
英文:
The data set birth.csv
collected at the Baystate Medical Center, Springfield, USA during 1986 has the following format
After I imported the csv file (using read.csv()
with colClasses
specification), the output of the function str()
didn't match with that of the function head()
. For example, the first 6 values of the column low
were supposed to be 0 but the output sample generated by str()
showed they were 1
'data.frame': 189 obs. of 9 variables:
$ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... # shouldn't they be 0 0 0 0... instead?
$ age : num 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : num 182 155 105 108 107 124 118 103 123 113 ...
$ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
$ ptl : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
$ ftv : Factor w/ 3 levels "0","1","2": 1 3 2 3 1 1 2 2 2 1 ...
A data.frame: 6 × 9
low age lwt race smoke ptl ht ui ftv
<fct><dbl><dbl><fct><fct><fct><fct><fct><fct>
1 0 19 182 2 0 0 0 1 0
2 0 33 155 3 0 0 0 0 2
3 0 20 105 1 1 0 0 0 1
4 0 21 108 1 1 0 0 1 2
5 0 18 107 1 1 0 0 1 0
6 0 21 124 3 0 0 0 0 0
Could someone please explain what happened? If I built a logistic model for that imported dataset, would the result be wrong?
答案1
得分: 1
在R中,因子(即分类变量,在tibble列标签中表示为<fct>
)内部存储为整数,其中1
表示第一个水平(或类别),2
表示第二个水平,以此类推,同时还有一个查找表将整数值映射到它们的标签/水平。
使用str()
函数查看一些水平以及它们的整数值。大多数其他函数会打印标签,而不是整数值。
在你的情况下可能会有些混淆,因为你的标签是(字符类)整数,从0开始。为了更清晰的示例,让我们看一个标签为字母的因子:
x = factor(c("a", "b", "a", "c"))
x
# [1] a b a c
# Levels: a b c
str(x)
# Factor w/ 3 levels "a","b","c": 1 2 1 3
英文:
Factors (categorical variables, <fct>
in the tibble column class labels) in R are stored internally as integers with 1
being the first level (or category), 2
the second level, etc., along with a lookup table mapping the integer values to their labels/levels.
str()
a few of the levels and then the integer values. Most other functions print the labels, not the integer values.
It's extra confusing in your case because your labels are (character-class) integers starting at 0. For a somewhat clearer example, let's look at a factor with letters as the labels
x = factor(c("a", "b", "a", "c"))
x
# [1] a b a c
# Levels: a b c
str(x)
# Factor w/ 3 levels "a","b","c": 1 2 1 3
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论