从SAS数据集中绘制随机样本的最快方式是什么?

huangapple go评论51阅读模式
英文:

What is the quickest way to draw a random sample from SAS dataset?

问题

我有一个非常大的SAS数据集(sas7bdat,1360亿行,636列)。我只需要一个包含10,000行的随机样本来测试代码 - 不需要具有统计代表性。

我正在运行代码,我猜它可以工作,但运行起来太慢了。

libname input "/some_linux_path/"

proc surveyselect data=input.large_data
	out=random_sample
	method=srs
	sampsize=10000;
run;

是否有更快的方法?也许只需取前10,000个观测值?

英文:

I have a very large SAS data set (sas7bdat, 136 billion rows, 636 columns). I just need a random sample of 10k rows to test code - does not need to be statistically representative.

I am running code which I guess works but takes so long to run.

libname input "/some_linux_path/";

proc surveyselect data=input.large_data
	out=random_sample
	method=srs
	sampsize=10000;
run;

Is there a quicker way? Perhaps just take the first 10k observations?

答案1

得分: 3

以下是翻译好的部分:

  1. 第一种方法是通过数据步骤获取前10000个观测值:
data sequential_sample;
    set have(obs=10000);
run;
  1. 如果您想高效地随机抽样10000个值,可以使用point=数据集选项进行直接访问。我们将从1到n之间随机抽取一个数字,其中n是数据集中的观测数。这个随机数对应数据集中的一行。此方法使用带有替代的随机抽样,对于足够小的样本比例,这不太可能成为问题。
data random_sample;
    do i = 1 to 10000;
        rand = ceil(rand('uniform', 0, n) );
        set have point=rand nobs=n;
        output;
    end;

    stop;
    drop i;
run;

只要确保数据集中没有inrand这些变量。如果有的话,使用其他变量名。

  1. 这段代码有点奇怪,因为在定义n之前使用了它。这是nobs=选项的一个独特特性。SAS在读取第一行之前就知道数据集中的观测数并将其加载到n中。

  2. 现在假设您希望在从数据集中进行随机抽样时出现重复抽取的可能性。我们可以修改上述代码,以继续选择一个随机数,直到我们找到一个尚未见过的。

/* 设置样本大小 */
%let s = 10000;

data random_sample;
    array rand_nums[&s.] _TEMPORARY_;

    do i = 1 to &s.;

        /* 初始化一个标志,使循环一直运行,直到它为0 */
        flag_repeat_draw = 1;

        /* 一直生成随机数,直到找到一个我们尚未见过的 */
        do while(flag_repeat_draw);
            rand = ceil(rand('uniform', 0, n) ); 

            /* 如果我们尚未见过这个数字,保存它并将重复抽取标志设置为0,以强制结束循环 */
            if(rand NOT IN rand_nums) then do;
                rand_nums[i]     = rand;
                flag_repeat_draw = 0;
            end;
        end;

        set have point=rand nobs=n;
        output;
    end;

    stop;
    drop i;
run;

请注意,这种方法在样本大小接近实际观测数时变得越来越低效。

英文:

The easiest way to take the first 10000 observations is through a data step:

data sequential_sample;
    set have(obs=10000);
run;

If you want to efficiently sample 10000 random values, use direct access with the point= dataset option. We'll draw a random number from 1 to n, where n is the number of observations in the dataset. The random number corresponds to a row on the dataset. This method uses random sampling with replacement, which is unlikely to be a problem for a sufficiently small sample proportion.

data random_sample;
    do i = 1 to 10000;
        rand = ceil(rand('uniform', 0, n) );
        set have point=rand nobs=n;
        output;
    end;

    stop;
    drop i;
run;

Just be sure i, n, and rand are not in your dataset. If they are, use other variable names.

This code is a little strange because n is used before we define it. This is a unique feature of the nobs= option. SAS knows the number of observations in the dataset and loads it into n before the first row is read in.

Now let's say you don't want to have the possibility of a repeat draw when randomly sampling from a dataset. We can modify the above code to continue to choose a random number until we get one that we have not seen yet.

/* Set sample size */
%let s = 10000;

data random_sample;
    array rand_nums[&s.] _TEMPORARY_;

    do i = 1 to &s.;

        /* Initialize a flag that keeps a loop going until it's 0 */
        flag_repeat_draw = 1;

        /* Keep generating a random number until we find one that we 
           haven't seen yet */
        do while(flag_repeat_draw);
            rand = ceil(rand('uniform', 0, n) ); 

            /* If we haven't seen this number, save it and set the
               repeat draw flag to 0 to force the loop to end */
            if(rand NOT IN rand_nums) then do;
                rand_nums[i]     = rand;
                flag_repeat_draw = 0;
            end;
        end;

        set have point=rand nobs=n;
        output;
    end;

    stop;
    drop i;
run;

Note that this method becomes increasingly inefficient the closer the sample size is to the actual number of observations.

huangapple
  • 本文由 发表于 2023年3月7日 02:28:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75654523.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定