Read an excel file with separate range of cells.

huangapple go评论64阅读模式
英文:

Read an excel file with separate range of cells

问题

我正在尝试使用最常用的函数(至少对我来说是)从xlsx文件中导入单独的单元格范围。我知道你可以导入数据然后选择,但我猜可能可以用另一种方式来做。

#选项1
dat <- readxl::read_xlsx("C:/Users/javie/Documents/Estadística/Vitamina B12/Exoma_B12.xlsx", col_names = F, sheet = 2, skip = 1, range = cell_cols(c("A:L", "AP:BA")))

#选项2
dat <- readxl::read_xlsx("C:/Users/javie/Documents/Estadística/Vitamina B12/Exoma_B12.xlsx", col_names = F, sheet = 2, skip = 1, range = c("A:L", "AP:BA"))

#选项3
dat <- readxl::read_excel("C:/Users/javie/Documents/Estadística/Vitamina B12/Exoma_B12.xlsx", col_names = F, sheet = 2, skip = 1, range = cell_cols(c("A:L", "AP:BA")))

#选项4
dat <- readxl::read_excel("C:/Users/javie/Documents/Estadística/Vitamina B12/Exoma_B12.xlsx", col_names = F, sheet = 2, skip = 1, range = c("A:L", "AP:BA"))
英文:

I am trying to import a separate range of cells from an xlsx file using the most common functions (at least for me). I know you can import data and the select, but I guess it's possible to do it the other way

#Option 1
dat &lt;- readxl::read_xlsx(&quot;C:/Users/javie/Documents/Estad&#237;stica/Vitamina B12/Exoma_B12.xlsx&quot;, col_names = F, sheet = 2, skip = 1, range = cell_cols(c(&quot;A:L&quot;, &quot;AP:BA&quot;)))

#Option 2
dat &lt;- readxl::read_xlsx(&quot;C:/Users/javie/Documents/Estad&#237;stica/Vitamina B12/Exoma_B12.xlsx&quot;, col_names = F, sheet = 2, skip = 1, range = c(&quot;A:L&quot;, &quot;AP:BA&quot;))


#Option 3
dat &lt;- readxl::read_excel(&quot;C:/Users/javie/Documents/Estad&#237;stica/Vitamina B12/Exoma_B12.xlsx&quot;, col_names = F, sheet = 2, skip = 1, range = cell_cols(c(&quot;A:L&quot;, &quot;AP:BA&quot;)))

#Option 4
dat &lt;- readxl::read_excel(&quot;C:/Users/javie/Documents/Estad&#237;stica/Vitamina B12/Exoma_B12.xlsx&quot;, col_names = F, sheet = 2, skip = 1, range = c(&quot;A:L&quot;, &quot;AP:BA&quot;))

I guess this is easy but as far as I was looking through SO there is no a clear solution

答案1

得分: 1

如果我理解问题正确,您想要读取工作簿的特定列。可以使用openxlsx2版本0.6.1来实现,如下所示:

library(openxlsx2)

# 创建示例文件
tmp <- temp_xlsx()
write_xlsx(list(head(iris), head(mtcars)), tmp)

# 从第二个工作表读取,从第2行开始,不包括列名
# 即使未写入,列L也包括在内
read_xlsx(tmp, sheet = 2, startRow = 2, colNames = FALSE, cols = c("A:C", "J:L"))
#>      A B   C J K  L
#> 2 21.0 6 160 4 4 NA
#> 3 21.0 6 160 4 4 NA
#> 4 22.8 4 108 4 1 NA
#> 5 21.4 6 258 3 1 NA
#> 6 18.7 8 360 3 2 NA
#> 7 18.1 6 225 3 1 NA

希望这对您有所帮助。

英文:

If I understand the question correctly, you want to read specific columns of a workbook. This can be achieved using openxlsx2 release 0.6.1 as follows:

library(openxlsx2)

# create example file
tmp &lt;- temp_xlsx()
write_xlsx(list(head(iris), head(mtcars)), tmp)

# read from second sheet, start at row 2 and avoid column names
# column L is included even though it is not written
read_xlsx(tmp, sheet = 2, startRow = 2, colNames = FALSE, cols = c(&quot;A:C&quot;, &quot;J:L&quot;))
#&gt;      A B   C J K  L
#&gt; 2 21.0 6 160 4 4 NA
#&gt; 3 21.0 6 160 4 4 NA
#&gt; 4 22.8 4 108 4 1 NA
#&gt; 5 21.4 6 258 3 1 NA
#&gt; 6 18.7 8 360 3 2 NA
#&gt; 7 18.1 6 225 3 1 NA

答案2

得分: 0

readxl 差不多能满足这个用例 (link),但不支持非连续的列(即 c('a:b','e:f') )。

例如:

library(readxl)
library(cellranger)

example_file <- readxl_example("datasets.xlsx")
read_excel(example_excel, range=cell_cols('c:d'), sheet=2, skip=1, col_names=TRUE) -> foo

输出:

> foo
# A tibble: 32 × 2
    disp    hp
   <dbl> <dbl>
 1  160    110
 2  160    110
 3  108     93
 4  258    110
 5  360    175
 6  225    105
 7  360    245
 8  147.    62
 9  141.    95
10  168.   123
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

Openxlsx::read.xlsx 也基本可以工作,但它会弄乱列的顺序:

> read.xlsx(example_excel, sheet=2, cols=c(1,3,4)) 

        mpg  disp  hp
    1  21.0 160.0 110
    2  21.0 160.0 110
    3  22.8 108.0  93
    4  21.4 258.0 110
    5  18.7 360.0 175
    6  18.1 225.0 105
    7  14.3 360.0 245

mpg 是原始数据中的第1列,它始终作为最左边的列返回。

作为第三个选项,相对来说也可以将 read_excel 包装成另一个函数,支持非连续的列。以下是我的尝试:

library(tidyverse); library(cellranger); library(readxl)
expand_sequence <- function(vec) {
  vec <- strsplit(vec, ":")
  
  vec <- unlist(lapply(vec, function(x) {
    if(length(x) > 1){
      sapply(seq(from = as.integer(charToRaw(as.character(x[1]))), 
                 to = as.integer(charToRaw(as.character(x[2])))), 
             function(y) rawToChar(as.raw(y)))
    } else {
      as.character(x)
    }
  }))
  
  return(vec)
}

better_read_excel <- function(file, columns, sheet, skip){
  columnvec <- expand_sequence(columns)
  
  data_list <- lapply(columnvec, function(i) {
    if(i == columnvec[1]) {
      table <- read_excel(file, range=cell_cols(i), sheet=sheet, skip=skip)
    } else {
      table <- read_excel(file, range=cell_cols(i), sheet=sheet, skip=skip, col_names=TRUE)
    }
    return(table)
  })
  
  table <- do.call(cbind, data_list)
  
  return(table)
}

better_read_excel(example_file, columns=c('a:c','h','i'), sheet=2, skip=1)

> better_read_excel(example_file, columns=c('a:c','h','i'), sheet=2, skip=1)
    mpg cyl  disp vs am
1  21.0   6 160.0  0  1
2  21.0   6 160.0  0  1
3  22.8   4 108.0  1  1
4  21.4   6 258.0  1  0
5  18.7   8 360.0  0  0
英文:

readxl almost covers this usecase (link), but does not support non-contiguous columns (i.e. c(&#39;a:b&#39;,&#39;e:f&#39;) )

For example:

library(readxl)
library(cellranger)

example_file &lt;- readxl_example(&quot;datasets.xlsx&quot;)
read_excel(example_excel, range=cell_cols(&#39;c:d&#39;), sheet=2, skip=1, col_names=TRUE) -&gt; foo

Output:

&gt; foo
# A tibble: 32 &#215; 2
    disp    hp
   &lt;dbl&gt; &lt;dbl&gt;
 1  160    110
 2  160    110
 3  108     93
 4  258    110
 5  360    175
 6  225    105
 7  360    245
 8  147.    62
 9  141.    95
10  168.   123
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

Openxlsx::read.xlsx also kind of works, but it messes up the column ordering:

&gt; read.xlsx(example_excel, sheet=2, cols=c(1,3,4)) 

        mpg  disp  hp
    1  21.0 160.0 110
    2  21.0 160.0 110
    3  22.8 108.0  93
    4  21.4 258.0 110
    5  18.7 360.0 175
    6  18.1 225.0 105
    7  14.3 360.0 245

mpg is column 1 in the raw data, and it is always returned as the left-most column.

As a third option, it is also relatively straightforward to wrap read_excel into another function, that supports non-contiguous columns. Here's my attempt at that:

library(tidyverse); library(cellranger); library(readxl)
expand_sequence &lt;- function(vec) {
  vec &lt;- strsplit(vec, &quot;:&quot;)
  
  vec &lt;- unlist(lapply(vec, function(x) {
    if(length(x) &gt; 1){
      sapply(seq(from = as.integer(charToRaw(as.character(x[1]))), 
                 to = as.integer(charToRaw(as.character(x[2])))), 
             function(y) rawToChar(as.raw(y)))
    } else {
      as.character(x)
    }
  }))
  
  return(vec)
}

    better_read_excel &lt;- function(file, columns, sheet, skip){
  columnvec &lt;- expand_sequence(columns)
  
  data_list &lt;- lapply(columnvec, function(i) {
    if(i == columnvec[1]) {
      table &lt;- read_excel(file, range=cell_cols(i), sheet=sheet, skip=skip)
    } else {
      table &lt;- read_excel(file, range=cell_cols(i), sheet=sheet, skip=skip, col_names=TRUE)
    }
    return(table)
  })
  
  table &lt;- do.call(cbind, data_list)
  
  return(table)
}

better_read_excel(example_file, columns=c(&#39;a:c&#39;,&#39;h&#39;,&#39;i&#39;), sheet=2, skip=1)

&gt; better_read_excel(example_file, columns=c(&#39;a:c&#39;,&#39;h&#39;,&#39;i&#39;), sheet=2, skip=1)
    mpg cyl  disp vs am                                                                                                               
1  21.0   6 160.0  0  1
2  21.0   6 160.0  0  1
3  22.8   4 108.0  1  1
4  21.4   6 258.0  1  0
5  18.7   8 360.0  0  0

huangapple
  • 本文由 发表于 2023年5月17日 16:05:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76269828.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定