英文:
Storing PROC MEANS data and using mean in another dataset
问题
有没有一种方法可以存储PROC MEANS过程的数据,并将均值和标准差存储以供在另一个数据集中使用?我想能够使用均值和标准差来过滤数据集中的异常值。我刚刚复制并粘贴了PROC MEANS的数据来执行此操作,但是否有一种自动执行此操作的方法?
以下是我一直在尝试的内容,但我无法在数据集“outliers”中使用变量orgMean和orgSTD。
DATA original;
PROC MEANS DATA = original;
VAR importantVar;
OUTPUT OUT=originalMeans;
MEAN = orgMean
STD = orgSTD;
现在,我想使用orgMean和orgSTD从原始数据创建一个新的异常值数据集。
DATA outliers;
SET original;
IF (importantVar > orgMean + 2*STD)
是否有更简单的方法可以做到这一点?我对SAS非常陌生,似乎无法在网上找到答案,我更喜欢能够使用PROC MEANS中的变量,因为它可能对我需要编写的其他代码有用。
英文:
Is there a way to store the data from the PROC MEANS procedure and store the mean and standard deviation to be used in another dataset? I want to be able to take the mean and standard deviation to filter out the outliers from the dataset. I have just copied and pasted the data from proc means to do this however is there a way to do this automatically?
This is what I have been trying but I cannot use the variables orgMean and orgSTD in the data outliers.
DATA original;
PROC MEANS DATA = original;
VAR importantVar;
OUTPUT OUT=originalMeans;
MEAN = orgMean
STD = orgSTD;
and now I want to use the orgMean and orgSTD to make a new dataset for outliers from the original data
DATA outliers;
SET original;
IF (importantVar > orgMean + 2*STD)
Is there an easier way to do this? I am very new to SAS and I cannot seem to find the answer online with searching, I would prefer if I could use the variables from the PROC Means, as it could be useful in the other code I need to write.
答案1
得分: 1
您正在询问如何将一个已有数据集与只有一个观测值的数据集合并。一个简单的方法是在数据步骤的第一次传递时,仅设置一次包含一个观测值的数据集。
data outliers;
set original;
if _n_=1 then set originalMeans;
if importantVar > (orgMean + 2*STD);
run;
英文:
You are asking how to combine an existing dataset with one that has only one observation. An easy way to do that is to SET the one observation dataset only once, on the first pass of the data step.
data outliers;
set original;
if _n_=1 then set originalMeans;
if importantVar > (orgMean + 2*STD);
run;
答案2
得分: 0
最简单的方法是在后续的数据步骤中使用 symputx()
将它们存储到宏变量中。宏和宏变量是SAS中的高级但重要的概念。我建议阅读 Introduction to SAS Macro Language 来了解它们。
简而言之,SAS宏语言基本上是一种非常高级的复制/粘贴,您可以将文本存储在变量中并在开放代码中使用它们。您可以通过 &
调用这些变量,就像这样:&foo
。&foo
是一个可能存储一些文本的宏变量。您可以以多种方式给它文本,比如:
%let foo = bar;
如果您键入 %put &foo
,它将在日志中写入 bar
。另一种将文本存储到宏变量的方法是通过数据步骤中的 symputx()
函数。我们将使用数据步骤将您的平均值和标准差保存到两个宏变量中,然后将它们传递给最终的数据步骤。看起来是这样的:
PROC MEANS DATA = original;
VAR importantVar;
OUTPUT OUT=originalMeans
MEAN = orgMean
STD = orgSTD
;
run;
data _null_;
set originalMeans;
call symputx('orgMean', orgMean);
call symputx('std', orgStd);
run;
data want;
set original;
where importantVar > &orgMean. + 2*&std.;
run;
这里发生了什么?
call symputx()
将变量 orgMean
和 orgStd
保存到宏变量 &orgMean
和 &std
中。我们可以在开放代码中随处使用它们。
请注意,我们的 where
语句如下所示:
data want;
set original;
where importantVar > > &orgMean + 2*&std;
run;
假设您的平均值是2,标准差是5。对于SAS来说,在宏变量解析时,它看起来是这样的:
data want;
set original;
where importantVar > 2. + 2*5;
run;
换句话说,&orgMean
解析为2,&std
解析为5。这发生在数据步骤编译之前。如果您参加SAS宏语言课程,您将了解所有这些概念以及它们的强大之处。
英文:
The easiest way is to store them into macro variables with symputx()
in a subsequent data step. Macros and macro variables are an advanced but important concept in SAS. I recommend reading about them in Introduction to SAS Macro Language.
In short, the SAS Macro language is basically a very fancy copy/paste where you can store text in variables and use them in open code. You call those variables with an &
, like this: &foo
. &foo
is a macro variable that could store some text. You can give it text a number of ways, such as:
%let foo = bar;
If you type %put &foo
, it will write bar
to the log. Another way to store text to macro variables it through the symputx()
function in the data step. We're going to use the data step to save your mean and std into two macro variables, then pass those into your final data step. Here's what that looks like.
PROC MEANS DATA = original;
VAR importantVar;
OUTPUT OUT=originalMeans
MEAN = orgMean
STD = orgSTD
;
run;
data _null_;
set originalMeans;
call symputx('orgMean', orgMean);
call symputx('std', orgStd);
run;
data want;
set original;
where importantVar > &orgMean. + 2*&std.;
run;
What's happening here?
call symputx()
saves the variables orgMean
and orgStd
into the macro variables &orgMean
and &std
respectively. We can use these wherever we want in open code.
Note that our where
statement looks like this:
data want;
set original;
where importantVar > &orgMean + 2*&std;
run;
Suppose your mean is 2 and std is 5. As far as SAS is concerned, it looks like this when the macro variables resolve:
data want;
set original;
where importantVar > 2. + 2*5;
run;
In other words, &orgMean
resolves to 2 and &std
resolves to 5. This happens before the data step is compiled. If you take a SAS macro language course, you'll learn all about these concepts and how powerful they are.
答案3
得分: 0
PROC MEANS方法后跟一个DATA步骤是一个不错的方法。由于你提到你对SAS是新手,如果你有SQL经验,你可以考虑使用PROC SQL方法。 PROC SQL具有一个不属于SQL标准的特性,它可以自动将汇总统计信息重新合并到行数据上。这使得编写一个查询以选择异常值变得非常简单,例如:
proc sql ;
create table HighOutliers
as select *
from sashelp.cars
having MSRP > (mean(MSRP) + 2*std(MSRP))
;
quit ;
英文:
The PROC MEANS approach followed by a DATA step is a good one. Since you mention you're new to SAS, if you have experience with SQL, you might consider a PROC SQL approach. PROC SQL has a feature which is not part of the SQL standard, which will automatically "remerge" summary statistics back onto row data. This makes it straight-forward to write a query to select the outliers, e.g.:
proc sql ;
create table HighOutliers
as select *
from sashelp.cars
having MSRP > (mean(MSRP) + 2*std(MSRP))
;
quit ;
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论