2023年7月10日 12:51:05go评论100阅读模式

英文:

Confusion with the output of the function str

问题

在Baystate医疗中心（位于美国春田）于1986年收集的birth.csv数据集具有以下格式

导入csv文件后（使用read.csv()和colClasses参数），str()函数的输出与head()函数不匹配。例如，列low的前6个值应该是0，但str()生成的示例输出显示它们为1

'data.frame':	189 obs. of  9 variables:
 $ low  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...  # 但它们应该是 0 0 0 0... 对吗？
 $ age  : num  19 33 20 21 18 21 22 17 29 26 ...
 $ lwt  : num  182 155 105 108 107 124 118 103 123 113 ...
 $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
 $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
 $ ptl  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
 $ ftv  : Factor w/ 3 levels "0","1","2": 1 3 2 3 1 1 2 2 2 1 ...
A data.frame: 6 × 9
    low	age	lwt	race smoke	ptl	ht	ui	ftv
    <fct><dbl><dbl><fct><fct><fct><fct><fct><fct>
1	0	19	182	2	 0	    0	0	1	0
2	0	33	155	3	 0	    0	0	0	2
3	0	20	105	1	 1	    0	0	0	1
4	0	21	108	1	 1	    0	0	1	2
5	0	18	107	1	 1	    0	0	1	0
6	0	21	124	3	 0	    0	0	0	0

请问有人能解释一下发生了什么吗？如果我为导入的数据集构建了一个逻辑模型，结果会是错误的吗？

英文:

The data set birth.csv collected at the Baystate Medical Center, Springﬁeld, USA during 1986 has the following format

After I imported the csv file (using read.csv() with colClasses specification), the output of the function str() didn't match with that of the function head(). For example, the first 6 values of the column low were supposed to be 0 but the output sample generated by str() showed they were 1

&#39;data.frame&#39;:	189 obs. of  9 variables:
$ low  : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 1 1 1 1 1 1 1 ...  # shouldn&#39;t they be 0 0 0 0... instead?
$ age  : num  19 33 20 21 18 21 22 17 29 26 ...
$ lwt  : num  182 155 105 108 107 124 118 103 123 113 ...
$ race : Factor w/ 3 levels &quot;1&quot;,&quot;2&quot;,&quot;3&quot;: 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 2 2 2 1 1 1 2 2 ...
$ ptl  : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 1 1 1 1 1 1 1 ...
$ ht   : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 1 1 1 1 1 1 1 1 1 1 ...
$ ui   : Factor w/ 2 levels &quot;0&quot;,&quot;1&quot;: 2 1 1 2 2 1 1 1 1 1 ...
$ ftv  : Factor w/ 3 levels &quot;0&quot;,&quot;1&quot;,&quot;2&quot;: 1 3 2 3 1 1 2 2 2 1 ...
A data.frame: 6 &#215; 9
low	age	lwt	race smoke	ptl	ht	ui	ftv
&lt;fct&gt;&lt;dbl&gt;&lt;dbl&gt;&lt;fct&gt;&lt;fct&gt;&lt;fct&gt;&lt;fct&gt;&lt;fct&gt;&lt;fct&gt;
1	0	19	182	2	 0	    0	0	1	0
2	0	33	155	3	 0	    0	0	0	2
3	0	20	105	1	 1	    0	0	0	1
4	0	21	108	1	 1	    0	0	1	2
5	0	18	107	1	 1	    0	0	1	0
6	0	21	124	3	 0	    0	0	0	0

Could someone please explain what happened? If I built a logistic model for that imported dataset, would the result be wrong?

答案1

得分: 1

在R中，因子（即分类变量，在tibble列标签中表示为<fct>）内部存储为整数，其中1表示第一个水平（或类别），2表示第二个水平，以此类推，同时还有一个查找表将整数值映射到它们的标签/水平。

使用str()函数查看一些水平以及它们的整数值。大多数其他函数会打印标签，而不是整数值。

在你的情况下可能会有些混淆，因为你的标签是（字符类）整数，从0开始。为了更清晰的示例，让我们看一个标签为字母的因子：

x = factor(c("a", "b", "a", "c"))
x
# [1] a b a c
# Levels: a b c
str(x)
# Factor w/ 3 levels "a","b","c": 1 2 1 3

英文:

Factors (categorical variables, <fct> in the tibble column class labels) in R are stored internally as integers with 1 being the first level (or category), 2 the second level, etc., along with a lookup table mapping the integer values to their labels/levels.

str() a few of the levels and then the integer values. Most other functions print the labels, not the integer values.

It's extra confusing in your case because your labels are (character-class) integers starting at 0. For a somewhat clearer example, let's look at a factor with letters as the labels

x = factor(c(&quot;a&quot;, &quot;b&quot;, &quot;a&quot;, &quot;c&quot;))
x
# [1] a b a c
# Levels: a b c
str(x)
# Factor w/ 3 levels &quot;a&quot;,&quot;b&quot;,&quot;c&quot;: 1 2 1 3

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

对函数str的输出感到困惑。

问题

答案1

合并数据框时，多个匹配项可能存在时，不重复数据。

Manipulating Single Values in R to Column values

如何使用R应用混合模型设计计算样本大小

在ggplot2中显示点与线的图例

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。