2023年3月7日 02:28:22go评论59阅读模式

英文:

What is the quickest way to draw a random sample from SAS dataset?

问题

我有一个非常大的SAS数据集（sas7bdat，1360亿行，636列）。我只需要一个包含10,000行的随机样本来测试代码 - 不需要具有统计代表性。

我正在运行代码，我猜它可以工作，但运行起来太慢了。

libname input &quot;/some_linux_path/&quot;

proc surveyselect data=input.large_data
	out=random_sample
	method=srs
	sampsize=10000;
run;

是否有更快的方法？也许只需取前10,000个观测值？

英文:

I have a very large SAS data set (sas7bdat, 136 billion rows, 636 columns). I just need a random sample of 10k rows to test code - does not need to be statistically representative.

I am running code which I guess works but takes so long to run.

libname input &quot;/some_linux_path/&quot;;

proc surveyselect data=input.large_data
	out=random_sample
	method=srs
	sampsize=10000;
run;

Is there a quicker way? Perhaps just take the first 10k observations?

答案1

得分: 3

以下是翻译好的部分：

第一种方法是通过数据步骤获取前10000个观测值：

data sequential_sample;
    set have(obs=10000);
run;

如果您想高效地随机抽样10000个值，可以使用point=数据集选项进行直接访问。我们将从1到n之间随机抽取一个数字，其中n是数据集中的观测数。这个随机数对应数据集中的一行。此方法使用带有替代的随机抽样，对于足够小的样本比例，这不太可能成为问题。

data random_sample;
    do i = 1 to 10000;
        rand = ceil(rand('uniform', 0, n) );
        set have point=rand nobs=n;
        output;
    end;

    stop;
    drop i;
run;

只要确保数据集中没有i、n和rand这些变量。如果有的话，使用其他变量名。

这段代码有点奇怪，因为在定义n之前使用了它。这是nobs=选项的一个独特特性。SAS在读取第一行之前就知道数据集中的观测数并将其加载到n中。
现在假设您不希望在从数据集中进行随机抽样时出现重复抽取的可能性。我们可以修改上述代码，以继续选择一个随机数，直到我们找到一个尚未见过的。

/* 设置样本大小 */
%let s = 10000;

data random_sample;
    array rand_nums[&s.] _TEMPORARY_;

    do i = 1 to &s.;

        /* 初始化一个标志，使循环一直运行，直到它为0 */
        flag_repeat_draw = 1;

        /* 一直生成随机数，直到找到一个我们尚未见过的 */
        do while(flag_repeat_draw);
            rand = ceil(rand('uniform', 0, n) ); 

            /* 如果我们尚未见过这个数字，保存它并将重复抽取标志设置为0，以强制结束循环 */
            if(rand NOT IN rand_nums) then do;
                rand_nums[i]     = rand;
                flag_repeat_draw = 0;
            end;
        end;

        set have point=rand nobs=n;
        output;
    end;

    stop;
    drop i;
run;

请注意，这种方法在样本大小接近实际观测数时变得越来越低效。

英文:

The easiest way to take the first 10000 observations is through a data step:

data sequential_sample;
    set have(obs=10000);
run;

If you want to efficiently sample 10000 random values, use direct access with the point= dataset option. We'll draw a random number from 1 to n, where n is the number of observations in the dataset. The random number corresponds to a row on the dataset. This method uses random sampling with replacement, which is unlikely to be a problem for a sufficiently small sample proportion.

data random_sample;
    do i = 1 to 10000;
        rand = ceil(rand(&#39;uniform&#39;, 0, n) );
        set have point=rand nobs=n;
        output;
    end;

    stop;
    drop i;
run;

Just be sure i, n, and rand are not in your dataset. If they are, use other variable names.

This code is a little strange because n is used before we define it. This is a unique feature of the nobs= option. SAS knows the number of observations in the dataset and loads it into n before the first row is read in.

Now let's say you don't want to have the possibility of a repeat draw when randomly sampling from a dataset. We can modify the above code to continue to choose a random number until we get one that we have not seen yet.

/* Set sample size */
%let s = 10000;

data random_sample;
    array rand_nums[&amp;s.] _TEMPORARY_;

    do i = 1 to &amp;s.;

        /* Initialize a flag that keeps a loop going until it&#39;s 0 */
        flag_repeat_draw = 1;

        /* Keep generating a random number until we find one that we 
           haven&#39;t seen yet */
        do while(flag_repeat_draw);
            rand = ceil(rand(&#39;uniform&#39;, 0, n) ); 

            /* If we haven&#39;t seen this number, save it and set the
               repeat draw flag to 0 to force the loop to end */
            if(rand NOT IN rand_nums) then do;
                rand_nums[i]     = rand;
                flag_repeat_draw = 0;
            end;
        end;

        set have point=rand nobs=n;
        output;
    end;

    stop;
    drop i;
run;

Note that this method becomes increasingly inefficient the closer the sample size is to the actual number of observations.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从SAS数据集中绘制随机样本的最快方式是什么？

问题

答案1

如何从巴西ENEM考试的PDF文件中提取问题中的分隔内容？

如何循环遍历一个 %let 语句的列表？

为什么我的 %DO %UNTIL 循环在 SAS 宏程序中条件未满足的情况下仍然执行？

SAS CSV导出每行都有不需要的前导逗号。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论