2023年4月17日 12:43:28go评论76阅读模式

英文:

Write to a external file a Classes ‘data.table’ and 'data.frame'

问题

The content you provided is in English, and you've requested a Chinese translation. Here's the translation of the provided content:

我正在使用R中的一个程序，它生成了如下输出：

str(maytable)
Classes ‘data.table’ and &#39;data.frame&#39;:  106876 obs. of  17 variables:
$ col1  : num  1 2 3 4  ...
$ col2  : chr  &quot;Chr00c00001&quot; &quot;Chr00c00001&quot; &quot;Chr00c00001&quot; &quot;Chr00c00001&quot; ...
$ col3  : num  1 2 3 4 ...
$ col4 :List of 106876
..$ : chr
..$ : chr
..$ : chr 
..$ : chr &quot;Chr1g00005011&quot;
.. [list output truncated]
$col4 :List of 106876
..$ : chr &quot;Chr1g00000491&quot;
..$ : chr
..$ : chr
..$ : chr &quot;Chr1g00000501&quot;
.. [list output truncated]

我想将其写入一个表格，其中每一列都是一个列名，数据在行中，类似如下所示，使用类似write.table的函数：

col1    col2      col3       col4       col5
  1  Chr00c00001   1                   Chr1g00000491
  2  Chr00c00001   2
  3  Chr00c00001   3
  4  Chr00c00001   4    Chr1g00005011  Chr1g00000501

我不太熟悉类别为data.table和data.frame的对象，它们似乎包含了列表以及列表元素中的其他列表。如果有人能为我提供关于这种对象的性质以及如何将其转换为可写入文本文件的格式的建议，那将非常好。

英文:

A program that I am using in R generates an output of Classes ‘data.table’ and 'data.frame'

And looks like this:

str(maytable)
Classes ‘data.table’ and &#39;data.frame&#39;:  106876 obs. of  17 variables:
$ col1  : num  1 2 3 4  ...
$ col2  : chr  &quot;Chr00c00001&quot; &quot;Chr00c00001&quot; &quot;Chr00c00001&quot; &quot;Chr00c00001&quot; ...
$ col3  : num  1 2 3 4 ...
$ col4 :List of 106876
..$ : chr
..$ : chr
..$ : chr 
..$ : chr &quot;Chr1g00005011&quot;
.. [list output truncated]
$col4 :List of 106876
..$ : chr &quot;Chr1g00000491&quot;
..$ : chr
..$ : chr
..$ : chr &quot;Chr1g00000501&quot;
.. [list output truncated]

I would like to write this to a table where each col is a column and the data on them are in the rows to have something like this using functions like write.table

col1    col2      col3       col4       col5
  1  Chr00c00001   1                   Chr1g00000491
  2  Chr00c00001   2
  3  Chr00c00001   3
  4  Chr00c00001   4    Chr1g00005011  Chr1g00000501

I am not familiar with an object of classes data.table and data.frame that apparently contains lists and some other lists inside the elements of the list. It would be great if someone could advise me on what kind of object I have and how to convert it into a format I can write into a text file.

答案1

得分: 1

以下是翻译好的内容：

这是关于“列表列”和/或“嵌套数据”的概念。它对许多事情都相对有用，但与此同时，许多在非嵌套的data.frame-like对象上工作得很好的函数不知道如何处理列表列/嵌套数据。这是有一个合理的原因的：简单（非列表）列明显只是向量，因此任何适用于向量的东西也适用于帧的列。然而，在列表列中，只要列表的长度与帧的行数相同，您就可以将任何东西放入该列表列的每个元素中。这包括NULL、任意长度的向量、图形对象（grobs）、其他类似于data.frame的对象、任意嵌套的列表等等。

但在您的情况下，看起来您的列表列是长度为0或1的向量。对这些数据的典型展开可能会删除长度为0的行，因此我们需要小心点，通过用NA或空字符串替换空元素来处理它们（因为您的列表列似乎是基于字符串的）。

我认为您的数据看起来类似于这样：

obj <- data.table(col1=1:4, col2="c001", col3=11:14, col4=list(NULL, NULL, NULL, "5011"), col5=list("491", NULL, NULL, "501"))
obj
#     col1   col2  col3   col4   col5
#    <int> <char> <int> <list> <list>
# 1:     1   c001    11           491
# 2:     2   c001    12              
# 3:     3   c001    13              
# 4:     4   c001    14   5011    501
Classes 'data.table' and 'data.frame':	4 obs. of  5 variables:
 $ col1: int  1 2 3 4
 $ col2: chr  "c001" "c001" "c001" "c001"
 $ col3: int  11 12 13 14
 $ col4:List of 4
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : chr "5011"
 $ col5:List of 4
  ..$ : chr "491"
  ..$ : NULL
  ..$ : NULL
  ..$ : chr "501"

我注意到，此示例明确显示了NULL，而您的数据没有。这不会改变任何内容：如果您的数据使用""代替NULL，那么我的“防止长度为0”的步骤将不会造成任何伤害。

我认为首先确认我们拥有的数据是否可以简化是一个安全的做法。也就是说，如果任何元素的长度为0（正如我上面提到的），我们需要用某种合理的空值来将它们的长度变为1。然而，如果任何元素的长度为2或更多，那么这意味着该行需要根据其长度重复。如果这是已知且期望的行为，那就没问题；如果不是，您需要考虑如何聚合/减少数据，例如min、mean、first、last或sample。

另一个注意事项：您有两个（或更多）列表列。如果一个列表列的长度大于1，而另一个列表列的长度不同，那么解决方案将变得更加复杂。如果它们的长度都相同，那么我们可以假设具有相同行中的长度-n元素的两个列表列应该扩展相同数量的行。然而，如果它们都是长度大于1但长度不同，那么...我们应该进行笛卡尔展开吗？截断？这样可能会出现许多问题。（我不会在这里“修复”这个条件。）

现在，我会假设：

长度为0的元素应该是NA（实际上是NA_character_）；
长度为2+的元素应该扩展行数。

再次强调，如果您的数据都是长度为1的向量，那么这不会造成任何问题。

obj[, (islist) := lapply(.SD, function(z) replace(z, !sapply(z, length), NA)), .SDcols = islist]
#     col1   col2  col3   col4   col5
#    <int> <char> <int> <list> <list>
# 1:     1   c001    11     NA    491
# 2:     2   c001    12     NA     NA
# 3:     3   c001    13     NA     NA
# 4:     4   c001    14   5011    501

从这里，我们可以使用tidyr::unnest：

tidyr::unnest(obj, c(col4, col5))
# # A tibble: 4 × 5
#    col1 col2   col3 col4  col5 
#   <int> <chr> <int> <chr> <chr>
# 1     1 c001     11 <NA>  491  
# 2     2 c001     12 <NA>  <NA> 
# 3     3 c001     13 <NA>  <NA> 
# 4     4 c001     14 5011  501

请注意，这将其从data.table类转换为tbl_df类；如果您打算继续使用data.table方言处理框架，那么您需要在此处使用as.data.table或setDT。

英文:

It's a notion of either (or both) "list-columns" and/or "nested data". It's relatively useful for many things but at the same time many functions that work great on non-nested data.frame-like objects do not know how to work with list-columns/nested data. This is for a reasonable reason: simple (non-list) columns are clearly just vectors, so anything that works on a vector works on a column of a frame. However, with list-columns, as long as the length of the list is the same as the number of rows in the frame, you can put anything into each element of that list-column. This includes NULL, arbitrary-length vectors, graphic-objects (grobs), other data.frame-like objects, arbitrarily-nested lists, etc.

In your case, though, it looks like your list-columns are length 0 or 1 vectors. The typical unnesting of this data might remove the rows with length-0, so we need to take a little care by replacing empty-elements with something reasonable, whether NA or an empty string (since your list-columns appear to be string-based).

I think your data looks similar to this:

obj &lt;- data.table(col1=1:4, col2=&quot;c001&quot;, col3=11:14, col4=list(NULL, NULL, NULL, &quot;5011&quot;), col5=list(&quot;491&quot;, NULL, NULL, &quot;501&quot;))
obj
#     col1   col2  col3   col4   col5
#    &lt;int&gt; &lt;char&gt; &lt;int&gt; &lt;list&gt; &lt;list&gt;
# 1:     1   c001    11           491
# 2:     2   c001    12              
# 3:     3   c001    13              
# 4:     4   c001    14   5011    501
Classes &#39;data.table&#39; and &#39;data.frame&#39;:	4 obs. of  5 variables:
 $ col1: int  1 2 3 4
 $ col2: chr  &quot;c001&quot; &quot;c001&quot; &quot;c001&quot; &quot;c001&quot;
 $ col3: int  11 12 13 14
 $ col4:List of 4
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : chr &quot;5011&quot;
 $ col5:List of 4
  ..$ : chr &quot;491&quot;
  ..$ : NULL
  ..$ : NULL
  ..$ : chr &quot;501&quot;

I recognize that this sample explicitly shows NULL whereas yours does not. This changes nothing: if your data has "" instead of NULL, then my "guard against length 0" step will do no harm.

I think it's a safe thing first to confirm that what we have will reduce simply. That is, if any of the elements are length-0 (as I mentioned above), we need to make them length-1 with some sentinel value of emptiness. If any element is length 2 or more, though, it would suggest that that row would need to be repeated per that length. If this is known and desired behavior, then all is good; if not, you need to think about how to aggregate/reduce the data, e.g., min, mean, first, last, or sample.

Another note: you have two (or more) list-columns. The solution becomes a lot murkier if you have length > 1 in one list-column and a different length in another list-column. If they are both the same length, then we may be good with the assumption that two list-columns with length-n elements in the same rows should expand the same number of rows. However, if they are both length > 1 but different lengths, then ... do we do a cartesian expansion? Truncation? Lots of ways this can go wrong. (I won't "fix" this condition here.)

For now, I'll assume:

length 0 elements should be NA (actually NA_character_);
length 2+ elements should expand the number of rows.

Again, if your data are all length-1 vectors then this will do no harm.

obj[, (islist) := lapply(.SD, function(z) replace(z, !sapply(z, length), NA)), .SDcols = islist]
#     col1   col2  col3   col4   col5
#    &lt;int&gt; &lt;char&gt; &lt;int&gt; &lt;list&gt; &lt;list&gt;
# 1:     1   c001    11     NA    491
# 2:     2   c001    12     NA     NA
# 3:     3   c001    13     NA     NA
# 4:     4   c001    14   5011    501

From here, we can use tidyr::unnest:

tidyr::unnest(obj, c(col4, col5))
# # A tibble: 4 &#215; 5
#    col1 col2   col3 col4  col5 
#   &lt;int&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
# 1     1 c001     11 &lt;NA&gt;  491  
# 2     2 c001     12 &lt;NA&gt;  &lt;NA&gt; 
# 3     3 c001     13 &lt;NA&gt;  &lt;NA&gt; 
# 4     4 c001     14 5011  501

Notice that this converted it from class data.table to class tbl_df; if you intend to continue using the data.table dialect of working on frames, then you'll need either as.data.table or setDT here.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将Classes ‘data.table’和’data.frame’写入外部文件。

问题

答案1

Pandas “Consecutive”/Rolling Percent Rank

LaTeX在R Markdown中表格列名中的应用

粗斜体的大写希腊字母在数学中

从分组的数据框创建堆叠的NumPy数组。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论