英文:
Concatenate three variables into a single variable and avoid duplicates
问题
我有三个变量。我的意图是将这三个变量连接成一个新变量,并避免重复值。
更新:
变量的值不总是相同的。对于最后一个示例,我希望保留两个值,用分号分隔。
初始数据:
变量1 | 变量2 | 变量3 |
-------------------------------------|
Bucket_001 | | Bucket_001 |
Bucket_002 | Bucket_002 | |
Bucket_003 | Bucket_003 | Bucket_003 |
| Bucket_004 | |
| Bucket_005 | Bucket_005 |
| | Bucket_006 |
| Bucket_007 | Bucket_008 |
结果:
变量
----------
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006
Bucket_007;Bucket_008
非常感谢任何帮助!
英文:
I have three variables. My intention is to concatenate three variables into one new variable and avoid duplicate values.
UPDATE:
Values of the variables are not always the same. For the last example i want to keep both values separated by a semicolon.
Initial data:
Var_1 | Var_2 | Var_3 |
-------------------------------------|
Bucket_001 | | Bucket_001 |
Bucket_002 | Bucket_002 | |
Bucket_003 | Bucket_003 | Bucket_003 |
| Bucket_004 | |
| Bucket_005 | Bucket_005 |
| | Bucket_006 |
| Bucket_007 | Bucket_008 |
Result:
Var
----------
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006
Bucket_007;Bucket_008
Any help much appreciated!
答案1
得分: 2
以下是您要翻译的内容:
With your additional requirement for "Values of the variables are not always the same." the code presented by Stu Sztukowski requires only a bit of polishing with use of the curobs=
option and ther Proc Transpose
.
Input data:
data have;
infile cards dlm='|' missover;
input (Var_1 Var_2 Var_3 ) (:$20.);
cards;
Bucket_001 | | Bucket_001 |
Bucket_002 | Bucket_002 | |
Bucket_003 | Bucket_003 | Bucket_003 |
| Bucket_004 | |
| Bucket_005 | Bucket_005 |
| | Bucket_006 |
| Bucket_007 | Bucket_008 |
;
proc print data=have;
run;
Code based on Stu's idea:
data want;
set
have(keep=var_1 rename=(var_1 = var) )
have(keep=var_2 rename=(var_2 = var) )
have(keep=var_3 rename=(var_3 = var) )
curobs=co
;
where NOT missing(var);
curobs=co;
run;
proc sort data=want nodupkey;
by curobs var;
run;
proc transpose data=want out=want(drop=_name_ curobs);
by curobs;
var var;
run;
data want;
set want;
var = catx(';',of col:);
keep var;
run;
proc print data=want;
run;
[EDIT:]
My version for "lazy programmers":
data want;
set have
curobs=co
;
curobs=co;
run;
proc transpose data=want out=want(drop=_name_);
by curobs;
var var:;
run;
proc sort data=want(where=(col1 is not null)) nodupkey;
by curobs col1;
run;
proc transpose data=want out=want(drop=_name_ curobs);
by curobs;
var col1;
run;
data want;
set want;
var = catx(';',of col:);
keep var;
run;
proc print data=want;
run;
The output in both cases is the same:
英文:
With your additional requirement for "Values of the variables are not always the same." the code presented by Stu Sztukowski requires only a bit of polishing with use of the curobs=
option and ther Proc Transpose
.
Input data:
data have;
infile cards dlm='|' missover;
input (Var_1 Var_2 Var_3 ) (:$20.);
cards;
Bucket_001 | | Bucket_001 |
Bucket_002 | Bucket_002 | |
Bucket_003 | Bucket_003 | Bucket_003 |
| Bucket_004 | |
| Bucket_005 | Bucket_005 |
| | Bucket_006 |
| Bucket_007 | Bucket_008 |
;
proc print data=have;
run;
Code based on Stu's idea:
data want;
set
have(keep=var_1 rename=(var_1 = var) )
have(keep=var_2 rename=(var_2 = var) )
have(keep=var_3 rename=(var_3 = var) )
curobs=co
;
where NOT missing(var);
curobs=co;
run;
proc sort data=want nodupkey;
by curobs var;
run;
proc transpose data=want out=want(drop=_name_ curobs);
by curobs;
var var;
run;
data want;
set want;
var = catx(';',of col:);
keep var;
run;
proc print data=want;
run;
[EDIT:]
My version for "lazy programmers":
data want;
set have
curobs=co
;
curobs=co;
run;
proc transpose data=want out=want(drop=_name_);
by curobs;
var var:;
run;
proc sort data=want(where=(col1 is not null)) nodupkey;
by curobs col1;
run;
proc transpose data=want out=want(drop=_name_ curobs);
by curobs;
var col1;
run;
data want;
set want;
var = catx(';',of col:);
keep var;
run;
proc print data=want;
run;
The output in both cases is the same :
答案2
得分: 1
将你的初始数据在顶部重复三次,每组数据中只保留一个变量。第一个组保留 var_1
,第二个保留 var_2
,第三个保留 var_3
。将每个变量重命名为 var
,以便它们堆叠到一个单一的变量中。使用 where
语句删除缺失值,然后使用 proc sort
删除重复项。
data want;
set have(keep=var_1 rename=(var_1 = var) )
have(keep=var_2 rename=(var_2 = var) )
have(keep=var_3 rename=(var_3 = var) )
;
where NOT missing(var);
run;
我们得到以下结果:
var
Bucket_001
Bucket_002
Bucket_003
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_001
Bucket_003
Bucket_005
Bucket_006
现在我们将删除所有重复项,完成了。
proc sort data=want nodupkey;
by var;
run;
如果你想变得更高级,你可以在一个步骤中使用哈希表来完成所有这些操作。哈希表在读取数据时会自动排序并删除重复的键。
data _null_;
length var $10.;
set have end=eof;
array v[*] var_:;
if(_N_ = 1) then do;
dcl hash h();
h.defineKey('var');
h.defineDone();
call missing(var);
end;
/* 注意:在使用仅包含键的哈希表的Add()方法中指定键时,尽管它不起作用,但仍需要数据参数 */
do i = 1 to dim(v);
if(NOT missing(v[i]) ) then rc = h.Add(key: v[i], data: v[i]);
end;
if(eof) then h.Output(dataset: 'want');
run;
英文:
Stack your initial data on top of itself three times, keeping only one variable in each set of the data. The first will keep var_1
, second var_2
, and third var_3
. Rename each variable to be var
so that they stack into a single variable. Remove missing values with a where
statement, then remove duplicates with proc sort
.
data want;
set have(keep=var_1 rename=(var_1 = var) )
have(keep=var_2 rename=(var_2 = var) )
have(keep=var_3 rename=(var_3 = var) )
;
where NOT missing(var);
run;
We get this:
var
Bucket_001
Bucket_002
Bucket_003
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_001
Bucket_003
Bucket_005
Bucket_006
Now we'll remove all the duplicates, and we're done.
proc sort data=want nodupkey;
by var;
run;
var
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006
If you want to get really advanced, you can do this all in a single step with a hash table. Hash tables automatically sort and remove duplicate keys when you read data into them.
data _null_;
length var $10.;
set have end=eof;
array v[*] var_:;
if(_N_ = 1) then do;
dcl hash h();
h.defineKey('var');
h.defineDone();
call missing(var);
end;
/* Note: when specifying a key in the Add() method with a key-only
hash table, the data argument is required even though it does nothing */
do i = 1 to dim(v);
if(NOT missing(v[i]) ) then rc = h.Add(key: v[i], data: v[i]);
end;
if(eof) then h.Output(dataset: 'want');
run;
答案3
得分: 1
在你的特定示例中,coalescec()
函数也可以使用。不确定它是否适用于你的实际数据,在那种情况下,请使用 Stu 的答案。
data want;
set have;
var = coalescec(var1, var2, var3);
keep var;
run;
SQL 选项,因为我没有看到其中之一发布过。
proc sql;
create table want as select * from
(select distinct var_1 as var from temperatures
union
select distinct var_2 as var from temperatures
union
select distinct var_3 as var from temperatures
) as a
where not missing(var);
quit;
英文:
In your particular example, the coalescec()
function would also work. Not sure it would work in your actual data, in that case use Stu's answer.
data want;
set have;
var = coalescec(var1, var2, var3);
keep var;
run;
A SQL option as I haven't seen one of those posted.
proc sql;
create table want as select * from
(select distinct var_1 as var from temperatures
union
select distinct var_2 as var from temperatures
union
select distinct var_3 as var from temperatures
) as a
where not missing(var);
quit;
答案4
得分: 0
有另一种方法:使用一个临时数组。
临时数组用于存储唯一的值:如果一个值以前从未出现过,则将其添加到数组中,因此在文件末尾,该数组拥有每个唯一值。之后,您只需输出此数组的值,这就是您想要的。
data temperatures;
infile cards dlm='|';
informat Var_1 Var_2 Var_3 $42.;
input Var_1 Var_2 Var_3 $;
datalines;
Bucket_001 | | Bucket_001 |
Bucket_002 | Bucket_002 | |
Bucket_003 | Bucket_004 | Bucket_005 |
| Bucket_006 | |
| Bucket_007 | Bucket_008 |
| | Bucket_006 |
;
run;
data want;
set temperatures end=eof;
array _var_ Var_1 Var_2 Var_3;
array _par_[%eval(&sysnobs.*3)]$42. _temporary_;
do i=1 to dim(_var_);
if _var_[i] not in _par_ then _par_[_n_*3+i-3]=_var_[i];
end;
if eof then do i=1 to dim(_par_);
if not missing(_par_[i]) then do;
var=_par_[i];
output;
end;
end;
run;
结果是:
var
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006
Bucket_007
Bucket_008
英文:
There is another method: Using a temporary array.
The temporary array is used to store unique value: If a value never showed before, add it to the array, so this array owns every unique value at the end of file. After that, you just output the values of this array, that is what you want.
data temperatures;
infile cards dlm='|';
informat Var_1 Var_2 Var_3 $42.;
input Var_1 Var_2 Var_3 $;
datalines;
Bucket_001 | | Bucket_001 |
Bucket_002 | Bucket_002 | |
Bucket_003 | Bucket_004 | Bucket_005 |
| Bucket_006 | |
| Bucket_007 | Bucket_008 |
| | Bucket_006 |
;
run;
data want;
set temperatures end=eof;
array _var_ Var_1 Var_2 Var_3;
array _par_[%eval(&sysnobs.*3)]$42. _temporary_;
do i=1 to dim(_var_);
if _var_[i] not in _par_ then _par_[_n_*3+i-3]=_var_[i];
end;
if eof then do i=1 to dim(_par_);
if not missing(_par_[i]) then do;
var=_par_[i];
output;
end;
end;
run;
The result is:
var
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006
Bucket_007
Bucket_008
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论