英文:
Error in checkForRemoteErrors(val) : 7 nodes produced errors; first error: could not find function "fread"
问题
这个问题中包含的所有代码都来自名为"LASSO code (Version for Antony)"的脚本,位于我项目的GitHub Repo中。您可以在名为"last 40"的文件夹中运行它,以验证我的说法,即它可以在有限大小的数据集上运行。如果您真的想走得更远,可以在这里给我发消息,我会通过OneDrive或Google Drive(您喜欢哪个都可以)分享一个包含数千个数据集的文件夹,以便您也可以验证相同的脚本在那个容量的文件夹中不起作用。
这绝对会让我发疯,我发誓,我已经使用下面的lappy函数一个星期了,没有任何问题,但几个小时前开始,它给我报错:
> datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: could not find function "fread"
这是我目前正在使用的脚本的其余部分,直到这一行(在我用来加载我使用的所有库之后):
# 这两行一起创建了一个简单的字符列表,其中包含您创建的数据集文件夹中的所有文件名
folderpath <- "C:/Users/Spencer/Documents/EER Project/12th & 13th 10k"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)
# 重新格式化每个csv文件格式的数据集的名称
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
# 对文件名列表进行排序,以使它们按适当的顺序排列
my_order = DS_names_list |>
# 分解数字,将它们转换为数字
strsplit(split = "-", fixed = TRUE) |>
unlist() |>
as.numeric() |>
# 将它们放入数据框中
matrix(nrow = length(DS_names_list), byrow = TRUE) |>
as.data.frame() |>
# 获取适当的排序顺序以对数据框进行排序
do.call(order, args = _)
DS_names_list = DS_names_list[my_order]
paths_list = paths_list[my_order]
# 这一行使用列表中每个商店的名称读取了所有csv文件中的所有数据
CL <- makeCluster(detectCores() - 2L)
clusterExport(CL, c('paths_list'))
library(data.table)
system.time( datasets <- parLapply(CL, paths_list, fread) )
在今天查看文档第三次之后,我正在考虑尝试:
system.time( datasets <- parLapply(CL, paths_list, fun = fread) )
这样会起作用吗?
P.S. 这是我首先加载的所有库:
# 加载所有必要的包
library(plyr)
library(dplyr)
library(tidyverse)
library(readr)
library(stringi)
library(purrr)
library(stats)
library(leaps)
library(lars)
library(elasticnet)
library(data.table)
library(parallel)
此外,我已经尝试了以下方法,但都没有成功:
datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
datasets <- parLapply(CL, paths_list, function(i) {fread[i]})
datasets <- parLapply(CL, paths_list, function(i) {fread[[i]]})
datasets <- parLapply(CL, paths_list, \(ds)
{fread(ds)})
system.time( datasets <- lapply(paths_list, fread) )
当我运行最后一个命令时,datasets <- lapply(paths_list, fread)
,我得到相同的错误。这正是我在上周初运行的最初成功版本,我之后选择使用并行版本,因为我要导入/加载的数据集文件夹中有26万个以csv格式文件的数据集。所以,这意味着两个已经成功运行了数十次的版本突然在今天停止工作!
英文:
All of the code included in this question is from the script called "LASSO code (Version for Antony)" in my GitHub Repo for this project. And you can run it on the file folder called "last 40" to verify my claim that it does run on limited sized datasets and if you really feel like going the extra mile, message me here and I'll share a 10k scale file folder full of datasets zipped of via OneDrive or Google Drive (whichever you prefer lad) with ya so you can also verify that the same script doesn't work in file folders of that volume.
This is absolutely going to drive me mad I swear, I have been using the lappy function below without issue for a week now, and starting several hours ago, it is giving me this error:
> datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: could not find function "fread"
Here is the rest of the script I am working with up until this line (after the lines I used to load all of the libraries I utilize):
# these 2 lines together create a simple character list of
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/12th & 13th 10k"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)
# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
# sort both of the list of file names so that they are in the proper order
my_order = DS_names_list |>
# split apart the numbers, convert them to numeric
strsplit(split = "-", fixed = TRUE) |> unlist() |> as.numeric() |>
# get them in a data frame
matrix(nrow = length(DS_names_list), byrow = TRUE) |> as.data.frame() |>
# get the appropriate ordering to sort the data frame
do.call(order, args = _)
DS_names_list = DS_names_list[my_order]
paths_list = paths_list[my_order]
# this line reads all of the data in each of the csv files
# using the name of each store in the list we just created
CL <- makeCluster(detectCores() - 2L)
clusterExport(CL, c('paths_list'))
library(data.table)
system.time( datasets <- parLapply(CL, paths_list, fread) )
After looking up the documentation for the 3rd time today, I am thinking of trying:
system.time( datasets <- parLapply(CL, paths_list, fun = fread) )
Will that work??
p.s. Here is all of the libraries I load as the first thing I do:
# load all necessary packages
library(plyr)
library(dplyr)
library(tidyverse)
library(readr)
library(stringi)
library(purrr)
library(stats)
library(leaps)
library(lars)
library(elasticnet)
library(data.table)
library(parallel)
Also, I have already tried the following and none worked:
datasets <- parLapply(CL, paths_list, function(i) {fread(i)})
datasets <- parLapply(CL, paths_list, function(i) {fread[i]})
datasets <- parLapply(CL, paths_list, function(i) {fread[[i]]})
datasets <- parLapply(CL, paths_list, \(ds)
{fread(ds)})
system.time( datasets <- lapply(paths_list, fread) )
And when I run that last one, datasets <- lapply(paths_list, fread), I get the same error, this was exactly the original successful version I ran at the beginning of last week and I only chose to use the parallel version because the datasets folder I am importing/loading has 260,000 csv file-formatted datasets in it. So, this means two version which have worked dozens of times already just stopped working suddenly today!
答案1
得分: 1
请看看这是否能保持一致。在我的Windows桌面上,有2万个文件,它还没有失败过(我复制并粘贴了您的40个文件多次)。它已经运行了5次,每次我都重新启动了R会话和RStudio。
很遗憾,问题是不确定性的,但这是并行计算的一部分。看看这个精简的示例是否能一致运行?
请注意,我避免使用library()
来消除由具有相同名称函数的包引起的命名冲突。此外,我在最后关闭了集群连接。
英文:
See if this works consistently. It hasn't failed yet on my Windows desktop with 20k files (I copied & pasted your 40 files a bunch). It's run 5 times and I've restarted the R session and RStudio each time.
It's too bad that the problem arises non-deterministically, but that's part of the parallel-computation game. See if this stripped-down example run consistently?
Notice I'm avoiding library()
to eliminate naming collisions caused by packages with identically-named functions. Also, I closed the cluster connection at the end.
# Enumerate files
paths_list <-
"~/Documents/delete-me/EER-Research-Project-main/20k" |>
list.files(full.names = T, recursive = T)
# Establish cluster
CL <- parallel::makeCluster(parallel::detectCores() - 2L)
parallel::clusterExport(CL, c('paths_list'))
# Read files
system.time({
datasets <- parallel::parLapply(CL, paths_list, data.table::fread)
})
# Stop cluster
parallel::stopCluster(CL)
#> user system elapsed
#> 7.09 1.22 101.93
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论