将Classes ‘data.table’和’data.frame’写入外部文件。

huangapple go评论61阅读模式
英文:

Write to a external file a Classes ‘data.table’ and 'data.frame'

问题

The content you provided is in English, and you've requested a Chinese translation. Here's the translation of the provided content:

我正在使用R中的一个程序,它生成了如下输出:

str(maytable)
Classes ‘data.table’ and 'data.frame':  106876 obs. of  17 variables:
$ col1  : num  1 2 3 4  ...
$ col2  : chr  "Chr00c00001" "Chr00c00001" "Chr00c00001" "Chr00c00001" ...
$ col3  : num  1 2 3 4 ...
$ col4 :List of 106876
..$ : chr
..$ : chr
..$ : chr 
..$ : chr "Chr1g00005011"
.. [list output truncated]
$col4 :List of 106876
..$ : chr "Chr1g00000491"
..$ : chr
..$ : chr
..$ : chr "Chr1g00000501"
.. [list output truncated]

我想将其写入一个表格,其中每一列都是一个列名,数据在行中,类似如下所示,使用类似write.table的函数:

col1    col2      col3       col4       col5
  1  Chr00c00001   1                   Chr1g00000491
  2  Chr00c00001   2
  3  Chr00c00001   3
  4  Chr00c00001   4    Chr1g00005011  Chr1g00000501

我不太熟悉类别为data.table和data.frame的对象,它们似乎包含了列表以及列表元素中的其他列表。如果有人能为我提供关于这种对象的性质以及如何将其转换为可写入文本文件的格式的建议,那将非常好。

英文:

A program that I am using in R generates an output of Classes ‘data.table’ and 'data.frame'

And looks like this:

str(maytable)
Classes ‘data.table’ and 'data.frame':  106876 obs. of  17 variables:
$ col1  : num  1 2 3 4  ...
$ col2  : chr  "Chr00c00001" "Chr00c00001" "Chr00c00001" "Chr00c00001" ...
$ col3  : num  1 2 3 4 ...
$ col4 :List of 106876
..$ : chr
..$ : chr
..$ : chr 
..$ : chr "Chr1g00005011"
.. [list output truncated]
$col4 :List of 106876
..$ : chr "Chr1g00000491"
..$ : chr
..$ : chr
..$ : chr "Chr1g00000501"
.. [list output truncated]

I would like to write this to a table where each col is a column and the data on them are in the rows to have something like this using functions like write.table

col1    col2      col3       col4       col5
  1  Chr00c00001   1                   Chr1g00000491
  2  Chr00c00001   2
  3  Chr00c00001   3
  4  Chr00c00001   4    Chr1g00005011  Chr1g00000501

I am not familiar with an object of classes data.table and data.frame that apparently contains lists and some other lists inside the elements of the list. It would be great if someone could advise me on what kind of object I have and how to convert it into a format I can write into a text file.

答案1

得分: 1

以下是翻译好的内容:

这是关于“列表列”和/或“嵌套数据”的概念。它对许多事情都相对有用,但与此同时,许多在非嵌套的data.frame-like对象上工作得很好的函数不知道如何处理列表列/嵌套数据。这是有一个合理的原因的:简单(非列表)列明显只是向量,因此任何适用于向量的东西也适用于帧的列。然而,在列表列中,只要列表的长度与帧的行数相同,您就可以将任何东西放入该列表列的每个元素中。这包括NULL、任意长度的向量、图形对象(grobs)、其他类似于data.frame的对象、任意嵌套的列表等等。

但在您的情况下,看起来您的列表列是长度为0或1的向量。对这些数据的典型展开可能会删除长度为0的行,因此我们需要小心点,通过用NA或空字符串替换空元素来处理它们(因为您的列表列似乎是基于字符串的)。

我认为您的数据看起来类似于这样:

obj <- data.table(col1=1:4, col2="c001", col3=11:14, col4=list(NULL, NULL, NULL, "5011"), col5=list("491", NULL, NULL, "501"))
obj
#     col1   col2  col3   col4   col5
#    <int> <char> <int> <list> <list>
# 1:     1   c001    11           491
# 2:     2   c001    12              
# 3:     3   c001    13              
# 4:     4   c001    14   5011    501
Classes 'data.table' and 'data.frame':	4 obs. of  5 variables:
 $ col1: int  1 2 3 4
 $ col2: chr  "c001" "c001" "c001" "c001"
 $ col3: int  11 12 13 14
 $ col4:List of 4
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : chr "5011"
 $ col5:List of 4
  ..$ : chr "491"
  ..$ : NULL
  ..$ : NULL
  ..$ : chr "501"

我注意到,此示例明确显示了NULL,而您的数据没有。这不会改变任何内容:如果您的数据使用""代替NULL,那么我的“防止长度为0”的步骤将不会造成任何伤害。

我认为首先确认我们拥有的数据是否可以简化是一个安全的做法。也就是说,如果任何元素的长度为0(正如我上面提到的),我们需要用某种合理的空值来将它们的长度变为1。然而,如果任何元素的长度为2或更多,那么这意味着该行需要根据其长度重复。如果这是已知且期望的行为,那就没问题;如果不是,您需要考虑如何聚合/减少数据,例如minmeanfirstlastsample

另一个注意事项:您有两个(或更多)列表列。如果一个列表列的长度大于1,而另一个列表列的长度不同,那么解决方案将变得更加复杂。如果它们的长度都相同,那么我们可以假设具有相同行中的长度-n元素的两个列表列应该扩展相同数量的行。然而,如果它们都是长度大于1但长度不同,那么...我们应该进行笛卡尔展开吗?截断?这样可能会出现许多问题。(我不会在这里“修复”这个条件。)

现在,我会假设:

  • 长度为0的元素应该是NA(实际上是NA_character_);
  • 长度为2+的元素应该扩展行数。

再次强调,如果您的数据都是长度为1的向量,那么这不会造成任何问题。

obj[, (islist) := lapply(.SD, function(z) replace(z, !sapply(z, length), NA)), .SDcols = islist]
#     col1   col2  col3   col4   col5
#    <int> <char> <int> <list> <list>
# 1:     1   c001    11     NA    491
# 2:     2   c001    12     NA     NA
# 3:     3   c001    13     NA     NA
# 4:     4   c001    14   5011    501

从这里,我们可以使用tidyr::unnest

tidyr::unnest(obj, c(col4, col5))
# # A tibble: 4 × 5
#    col1 col2   col3 col4  col5 
#   <int> <chr> <int> <chr> <chr>
# 1     1 c001     11 <NA>  491  
# 2     2 c001     12 <NA>  <NA> 
# 3     3 c001     13 <NA>  <NA> 
# 4     4 c001     14 5011  501  

请注意,这将其从data.table类转换为tbl_df类;如果您打算继续使用data.table方言处理框架,那么您需要在此处使用as.data.tablesetDT

英文:

It's a notion of either (or both) "list-columns" and/or "nested data". It's relatively useful for many things but at the same time many functions that work great on non-nested data.frame-like objects do not know how to work with list-columns/nested data. This is for a reasonable reason: simple (non-list) columns are clearly just vectors, so anything that works on a vector works on a column of a frame. However, with list-columns, as long as the length of the list is the same as the number of rows in the frame, you can put anything into each element of that list-column. This includes NULL, arbitrary-length vectors, graphic-objects (grobs), other data.frame-like objects, arbitrarily-nested lists, etc.

In your case, though, it looks like your list-columns are length 0 or 1 vectors. The typical unnesting of this data might remove the rows with length-0, so we need to take a little care by replacing empty-elements with something reasonable, whether NA or an empty string (since your list-columns appear to be string-based).

I think your data looks similar to this:

obj &lt;- data.table(col1=1:4, col2=&quot;c001&quot;, col3=11:14, col4=list(NULL, NULL, NULL, &quot;5011&quot;), col5=list(&quot;491&quot;, NULL, NULL, &quot;501&quot;))
obj
#     col1   col2  col3   col4   col5
#    &lt;int&gt; &lt;char&gt; &lt;int&gt; &lt;list&gt; &lt;list&gt;
# 1:     1   c001    11           491
# 2:     2   c001    12              
# 3:     3   c001    13              
# 4:     4   c001    14   5011    501
Classes &#39;data.table&#39; and &#39;data.frame&#39;:	4 obs. of  5 variables:
 $ col1: int  1 2 3 4
 $ col2: chr  &quot;c001&quot; &quot;c001&quot; &quot;c001&quot; &quot;c001&quot;
 $ col3: int  11 12 13 14
 $ col4:List of 4
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : chr &quot;5011&quot;
 $ col5:List of 4
  ..$ : chr &quot;491&quot;
  ..$ : NULL
  ..$ : NULL
  ..$ : chr &quot;501&quot;

I recognize that this sample explicitly shows NULL whereas yours does not. This changes nothing: if your data has &quot;&quot; instead of NULL, then my "guard against length 0" step will do no harm.

I think it's a safe thing first to confirm that what we have will reduce simply. That is, if any of the elements are length-0 (as I mentioned above), we need to make them length-1 with some sentinel value of emptiness. If any element is length 2 or more, though, it would suggest that that row would need to be repeated per that length. If this is known and desired behavior, then all is good; if not, you need to think about how to aggregate/reduce the data, e.g., min, mean, first, last, or sample.

Another note: you have two (or more) list-columns. The solution becomes a lot murkier if you have length > 1 in one list-column and a different length in another list-column. If they are both the same length, then we may be good with the assumption that two list-columns with length-n elements in the same rows should expand the same number of rows. However, if they are both length > 1 but different lengths, then ... do we do a cartesian expansion? Truncation? Lots of ways this can go wrong. (I won't "fix" this condition here.)

For now, I'll assume:

  • length 0 elements should be NA (actually NA_character_);
  • length 2+ elements should expand the number of rows.

Again, if your data are all length-1 vectors then this will do no harm.

obj[, (islist) := lapply(.SD, function(z) replace(z, !sapply(z, length), NA)), .SDcols = islist]
#     col1   col2  col3   col4   col5
#    &lt;int&gt; &lt;char&gt; &lt;int&gt; &lt;list&gt; &lt;list&gt;
# 1:     1   c001    11     NA    491
# 2:     2   c001    12     NA     NA
# 3:     3   c001    13     NA     NA
# 4:     4   c001    14   5011    501

From here, we can use tidyr::unnest:

tidyr::unnest(obj, c(col4, col5))
# # A tibble: 4 &#215; 5
#    col1 col2   col3 col4  col5 
#   &lt;int&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
# 1     1 c001     11 &lt;NA&gt;  491  
# 2     2 c001     12 &lt;NA&gt;  &lt;NA&gt; 
# 3     3 c001     13 &lt;NA&gt;  &lt;NA&gt; 
# 4     4 c001     14 5011  501  

Notice that this converted it from class data.table to class tbl_df; if you intend to continue using the data.table dialect of working on frames, then you'll need either as.data.table or setDT here.

huangapple
  • 本文由 发表于 2023年4月17日 12:43:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76031777.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定