将数据分为三个具有平衡数据的集合。

huangapple go评论69阅读模式
英文:

Splitting datas in three sets with balanced datas

问题

EDIT:好的,现在我有了我的训练、验证和测试集,其中的行属于同一组患者。但是,通过绘图测试,我发现原始数据集中的不平衡数据(来自LesionResponse结果,1:70%和0:30%)并没有得到很好的遵守...实际上,在训练数据中,我有近55/45的分布,这对我来说并不理想。我该如何纠正这个问题?

# 在训练集中统计LesionResponse的分布
summary(train$LesionResponse)
#   0   1
# 159 487

# 在验证集中统计LesionResponse的分布
summary(validation$LesionResponse)
#  0   1
# 33 170

# 在测试集中统计LesionResponse的分布
summary(test$LesionResponse)
#  0   1
# 77 126

大家好,
我有一个数据集(这里是一个示例),我必须建立一个针对"LesionResponse"结果的预测模型。
所以,首先,我将我的数据分为训练集(60%)、验证集和测试集(每个20%)。
我遇到了一个大问题,我的表中的许多行属于同一患者...为了避免偏见,我必须分割我的数据并考虑患者ID...
我陷入了困境,因为我不知道如何将我的数据分成三部分,并保持属于同一患者的行在一起。

这是我的代码:

# 数据示例
# 你的数据结构

我考虑了一个循环,它将使用 unique(PatientID) 数据集分成三部分,其中60%在训练集中,如果集合中没有平衡的结果,就会一次又一次地执行它。我更多地考虑了一个解决它的区间...
你们会怎么做呢?

英文:

EDIT : OK so now I have my train, validation and test sets with rows belonging to patients in same groups. But, using a plot test, I see that the original imbalanced data from the original dataset (from the outcome LesionResponse, 1: 70% and 0 : 30%) is not very respected...Indeed, in the training datas, I have a nearly 55/45 repartition and it's not really welcomed for me. How can I do to correct this ?

summary(train$LesionResponse)
#   0   1
# 159 487
summary(validation$LesionResponse)
#  0   1
# 33 170
summary(test$LesionResponse)
#  0   1
# 77 126

Hi guys,
I have my dataset (here an exemple) and I must build a predictive model for an outcome : "LesionResponse".
So I have in a first time split my datas in train (60%), validation and test (20% each) sets.
I have a huge problem, many rows of my table belong to same patients...so in order to dodge bias, I must divide my datas and take into account the PatientIDs...
I am here stuck because I don't know how to split my datas in three and keep the rows belonging to same patients together.

Here is my code :

structure(list(PatientID = c("P1", "P1", "P1", 
"P2", "P3", "P3", "P4", "P5", 
"P5", "P6"), LesionResponse = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 
    2L, 1L, 2L), .Label = c("0", 
    "1"), class = "factor"), pyrad_tum_original_shape_LeastAxisLength = c(19.7842995242803, 
    15.0703960571122, 21.0652247652897, 11.804125918871, 27.3980336338908, 
    17.0584330264122, 4.90406343942677, 4.78480430022189, 6.2170232078547, 
    5.96309532740722, 5.30141540007441), pyrad_tum_original_shape_Sphericity = c(0.652056853392657, 
    0.773719977240238, 0.723869070051882, 0.715122964970338, 
    0.70796498824535, 0.811937882810929, 0.836458991713367, 0.863337931630415, 
    0.851654860256904, 0.746212862162174), pyrad_tum_log.sigma.5.0.mm.3D_firstorder_Skewness = c(0.367453961973625, 
    0.117673346718817, 0.0992025164349288, -0.174029385779302, 
    -0.863570016875989, -0.8482193060411, -0.425424618080682, 
    -0.492420174157913, 0.0105111292451967, 0.249865833210199), pyrad_tum_log.sigma.5.0.mm.3D_glcm_Contrast = c(0.376932105256115, 
    0.54885738172596, 0.267158344601612, 2.90094719958076, 0.322424096161189, 
    0.221356030145403, 1.90012334870722, 0.971638740404501, 0.31547550396399, 
    0.653999340294952), pyrad_tum_wavelet.LHH_glszm_GrayLevelNonUniformityNormalized = c(0.154973213866752, 
    0.176128379241556, 0.171129002059539, 0.218343919352019, 
    0.345985943932352, 0.164905080489496, 0.104536489151874, 
    0.1280276816609, 0.137912385073012, 0.133420904484894), pyrad_tum_wavelet.LHH_glszm_LargeAreaEmphasis = c(27390.2818110851, 
    11327.7931034483, 51566.7948885976, 7261.68702290076, 340383.536555142, 
    22724.7792207792, 45.974358974359, 142.588235294118, 266.744186046512, 
    1073.45205479452), pyrad_tum_wavelet.LHH_glszm_LargeAreaLowGrayLevelEmphasis = c(677.011907073653, 
    275.281153810458, 582.131636238695, 173.747506476692, 6140.73990175018, 
    558.277670638306, 1.81042257642817, 4.55724031114589, 6.51794350173746, 
    19.144924585586), pyrad_tum_wavelet.LHH_glszm_SizeZoneNonUniformityNormalized = c(0.411899490603372, 
    0.339216399209913, 0.425584323452468, 0.355165782879786, 
    0.294934042125209, 0.339208410636982, 0.351742274819198, 
    0.394463667820069, 0.360735532720389, 0.36911240382811)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

i was thinking about a loop who would split a unique(PatientID) dataset in three with 60% in the train set, and if there is no balanced outcome in the sets, to do it again and again. I was thinking more of an interval to solve it...
How would you do guys ?

答案1

得分: 2

Edit 我误解了您希望如何处理患者ID。原始答案在底部,但请注意,分层将旨在在每个拆分中放置等量的每个患者ID。您应该使用@Rui Barradas指示的group_拆分函数。

library(tidymodels)

set.seed(217)
df_split <- group_initial_split(df, PatientID, prop = 4/5)
df_training <- training(df_split)
df_testing <- testing(df_split)
df_validation <- group_validation_split(df_training, PatientID, prop = 3/4)

Original reply
tidymodels框架中,您可以选择使用您的PatientID变量来进行分层抽样。结果的重抽样将具有相等的比例。

要创建所需的拆分,您可以首先将数据拆分为80:20的训练:测试,然后将训练集拆分为75:25的训练:验证。

library(tidymodels)

set.seed(217)
df_split <- initial_split(df, prop = 4/5, strata = PatientID)
df_training <- training(df_split)
df_testing <- testing(df_split)
df_validation <- validation_split(df_training, prop = 3/4, strata = PatientID)
英文:

Edit I misunderstood how you wished to handle PatientIDs. The original answer is at the bottom, however note stratification will aim to put equivalent proportions of each PatientID in each split. You should use the group_ splitting function indicated by @Rui Barradas.

library(tidymodels)

set.seed(217)
df_split &lt;- group_initial_split(df, PatientID, prop = 4/5)
df_training &lt;- training(df_split)
df_testing &lt;- testing(df_split)
df_validation &lt;- group_validation_split(df_training, PatientID, prop = 3/4)

Original reply
In the tidymodels framework you can opt to stratify the sampling using your PatientID variable. The resulting resamples will have equivalent proportions.

To create your desired splits you could first split the data 80:20 training:testing, then split the training set 75:25 into training:validation.

library(tidymodels)

set.seed(217)
df_split &lt;- initial_split(df, prop = 4/5, strata = PatientID)
df_training &lt;- training(df_split)
df_testing &lt;- testing(df_split)
df_validation &lt;- validation_split(df_training, prop = 3/4, strata = PatientID)

答案2

得分: 1

这里使用了rsample包来进行数据分割。首先将数据分为test和其他数据(在下面的代码中称为train),保持所有的PatientID在相同的子集中,然后再分割train

library(rsample)

set.seed(2023)
g <- group_initial_split(df1, group = PatientID, prop = 0.8)
train <- training(g)
test <- testing(g)
g <- group_initial_split(train, group = PatientID, prop = 3/4)
train <- training(g)
validation <- testing(g)

# 检查数据分割比例
df_list <- list(train = train, validation = validation, test = test)
sapply(df_list, nrow)
# >      train validation       test 
# >        600        199        201

# 这显示所有的组都属于同一个子集
lapply(df_list, \(x) unique(x[[1]]))
# > $train
# > [1] "P5"  "P9"  "P8"  "P3"  "P10" "P4" 
# > 
# > $validation
# > [1] "P2" "P7"
# > 
# > $test
# > [1] "P1" "P6"

创建于2023-02-17,使用reprex v2.0.2


测试数据

set.seed(2023)
p <- sprintf("P%d", 1:10)
n <- 1e3
df1 <- data.frame(
  PatientID = sample(p, n, TRUE),
  x = rnorm(n)
)

创建于2023-02-17,使用reprex v2.0.2

英文:

Here is a way with package rsample.
First split in test and other data (named train in the code below) keeping all PatientID in the same subsets, then split train.

library(rsample)

set.seed(2023)
g &lt;- group_initial_split(df1, group = PatientID, prop = 0.8)
train &lt;- training(g)
test &lt;- testing(g)
g &lt;- group_initial_split(train, group = PatientID, prop = 3/4)
train &lt;- training(g)
validation &lt;- testing(g)

# check data split proportions
df_list &lt;- list(train = train, validation = validation, test = test)
sapply(df_list, nrow)
#&gt;      train validation       test 
#&gt;        600        199        201

# this shows that all groups belong to one subset only
lapply(df_list, \(x) unique(x[[1]]))
#&gt; $train
#&gt; [1] &quot;P5&quot;  &quot;P9&quot;  &quot;P8&quot;  &quot;P3&quot;  &quot;P10&quot; &quot;P4&quot; 
#&gt; 
#&gt; $validation
#&gt; [1] &quot;P2&quot; &quot;P7&quot;
#&gt; 
#&gt; $test
#&gt; [1] &quot;P1&quot; &quot;P6&quot;

<sup>Created on 2023-02-17 with reprex v2.0.2</sup>


Test data

set.seed(2023)
p &lt;- sprintf(&quot;P%d&quot;, 1:10)
n &lt;- 1e3
df1 &lt;- data.frame(
  PatientID = sample(p, n, TRUE),
  x = rnorm(n)
)

<sup>Created on 2023-02-17 with reprex v2.0.2</sup>

答案3

得分: 1

以下是您要翻译的内容:

You could use a one-liner that samples one of 1:3 for unique patient IDs and splits df by that.

您可以使用一行代码,使用 sample1:3 中随机选择一个数字来为唯一的病人ID,然后按照这个数字来拆分 df

set.seed(42)
res <- split(df, with(df, ave(id, id, FUN=(x) sample.int(3, 1, prob=c(.6, .2, .2)))))

Tests:

测试:

test proportions (should approx. be [.6, .2, .2])

proportions(sapply(res, (x) length(unique(x$id)))) |> round(2)

1 2 3

0.53 0.25 0.22

test uniqueness

stopifnot(length(Reduce(intersect, lapply(res, [[, 'id'))) == 0)

Update

更新

To get more stable proportions, we could use fixed group sizes by repeating 1:3 by vector p.

为了获得更稳定的比例,我们可以使用向量 p 来重复 1:3 以获得固定的组大小。

len <- length(u <- unique(df$id))
p1 <- c(.2, .2)
rlp <- round(len*p1)
p <- c(len - sum(rlp), rlp)
set.seed(42)
a <- setNames(rep.int(1:3, p), sample(u))

res <- split(df, a[match(df$id, names(a))]) ## this line splits the df

proportions(sapply(res, (x) length(unique(x$id)))

1 2 3

0.6 0.2 0.2

test uniqueness

stopifnot(length(Reduce(intersect, lapply(res, [[, 'id'))) == 0)

数据:

数据:

set.seed(42)
n <- 200; np <- 100
df <- data.frame(id=paste0('P', as.integer(as.factor(sort(sample.int(np, n, replace=TRUE)))),
les=sample(0:1, n, replace=TRUE),
pyr=runif(n))

英文:

You could use a one-liner that samples one of 1:3 for unique patient IDs and splits df by that.

set.seed(42)
res &lt;- split(df, with(df, ave(id, id, FUN=\(x) sample.int(3, 1, prob=c(.6, .2, .2)))))

Tests:

## test proportions (should approx. be [.6, .2, .2])
proportions(sapply(res, \(x) length(unique(x$id)))) |&gt; round(2)
#    1    2    3 
# 0.53 0.25 0.22 

## test uniqueness
stopifnot(length(Reduce(intersect, lapply(res, `[[`, &#39;id&#39;))) == 0)

Update

To get more stable proportions, we could use fixed group sizes by repeating 1:3 by vector p.

len &lt;- length(u &lt;- unique(df$id))
p1 &lt;- c(.2, .2)
rlp &lt;- round(len*p1)
p &lt;- c(len - sum(rlp), rlp)
set.seed(42)
a &lt;- setNames(rep.int(1:3, p), sample(u))

res &lt;- split(df, a[match(df$id, names(a))])  ## this line splits the df

proportions(sapply(res, \(x) length(unique(x$id))))
#   1   2   3 
# 0.6 0.2 0.2 

## test uniqueness
stopifnot(length(Reduce(intersect, lapply(res, `[[`, &#39;id&#39;))) == 0)

Data:

set.seed(42)
n &lt;- 200; np &lt;- 100
df &lt;- data.frame(id=paste0(&#39;P&#39;, as.integer(as.factor(sort(sample.int(np, n, replace=TRUE))))),
                 les=sample(0:1, n, replace=TRUE),
                 pyr=runif(n))

huangapple
  • 本文由 发表于 2023年2月18日 01:48:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75487663.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定