将三个变量连接成一个变量并避免重复。

huangapple go评论54阅读模式
英文:

Concatenate three variables into a single variable and avoid duplicates

问题

我有三个变量。我的意图是将这三个变量连接成一个新变量,并避免重复值。

更新:
变量的值不总是相同的。对于最后一个示例,我希望保留两个值,用分号分隔。

初始数据:

变量1      | 变量2      | 变量3      |
-------------------------------------|
Bucket_001 |            | Bucket_001 |
Bucket_002 | Bucket_002 |            |
Bucket_003 | Bucket_003 | Bucket_003 |
           | Bucket_004 |            |
           | Bucket_005 | Bucket_005 |
           |            | Bucket_006 |
           | Bucket_007 | Bucket_008 |

结果:

变量
----------
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006
Bucket_007;Bucket_008

非常感谢任何帮助!

英文:

I have three variables. My intention is to concatenate three variables into one new variable and avoid duplicate values.

UPDATE:
Values of the variables are not always the same. For the last example i want to keep both values separated by a semicolon.

Initial data:

Var_1      | Var_2      | Var_3      |
-------------------------------------|
Bucket_001 |            | Bucket_001 |
Bucket_002 | Bucket_002 |            |
Bucket_003 | Bucket_003 | Bucket_003 |
           | Bucket_004 |            |
           | Bucket_005 | Bucket_005 |
           |            | Bucket_006 |
           | Bucket_007 | Bucket_008 |

Result:

Var
----------
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006
Bucket_007;Bucket_008

Any help much appreciated!

答案1

得分: 2

以下是您要翻译的内容:

With your additional requirement for "Values of the variables are not always the same." the code presented by Stu Sztukowski requires only a bit of polishing with use of the curobs= option and ther Proc Transpose.

Input data:

data have;
infile cards dlm='|' missover;
input (Var_1 Var_2 Var_3 ) (:$20.);
cards;
Bucket_001 |            | Bucket_001 |
Bucket_002 | Bucket_002 |            |
Bucket_003 | Bucket_003 | Bucket_003 |
           | Bucket_004 |            |
           | Bucket_005 | Bucket_005 |
           |            | Bucket_006 |
           | Bucket_007 | Bucket_008 |
;

proc print data=have;
run;

Code based on Stu's idea:

data want;
  set 
    have(keep=var_1 rename=(var_1 = var) )
    have(keep=var_2 rename=(var_2 = var) )
    have(keep=var_3 rename=(var_3 = var) )
  curobs=co
  ;
  where NOT missing(var);
  curobs=co;
run;

proc sort data=want nodupkey;
  by curobs var;
run;

proc transpose data=want out=want(drop=_name_ curobs);
  by curobs;
  var var;
run;

data want;
  set want;
  var = catx(';',of col:);
  keep var;
run;

proc print data=want;
run;

[EDIT:]

My version for "lazy programmers":

data want;
  set have
  curobs=co
  ;
  curobs=co;
run;

proc transpose data=want out=want(drop=_name_);
  by curobs;
  var var:;
run;

proc sort data=want(where=(col1 is not null)) nodupkey;
  by curobs col1;
run;

proc transpose data=want out=want(drop=_name_ curobs);
  by curobs;
  var col1;
run;

data want;
  set want;
  var = catx(';',of col:);
  keep var;
run;

proc print data=want;
run;

The output in both cases is the same:

将三个变量连接成一个变量并避免重复。

英文:

With your additional requirement for "Values of the variables are not always the same." the code presented by Stu Sztukowski requires only a bit of polishing with use of the curobs= option and ther Proc Transpose.

Input data:

data have;
infile cards dlm='|' missover;
input (Var_1 Var_2 Var_3 ) (:$20.);
cards;
Bucket_001 |            | Bucket_001 |
Bucket_002 | Bucket_002 |            |
Bucket_003 | Bucket_003 | Bucket_003 |
           | Bucket_004 |            |
           | Bucket_005 | Bucket_005 |
           |            | Bucket_006 |
           | Bucket_007 | Bucket_008 |
;

proc print data=have;
run;

Code based on Stu's idea:

data want;
  set 
    have(keep=var_1 rename=(var_1 = var) )
    have(keep=var_2 rename=(var_2 = var) )
    have(keep=var_3 rename=(var_3 = var) )
  curobs=co
  ;
  where NOT missing(var);
  curobs=co;
run;

proc sort data=want nodupkey;
  by curobs var;
run;

proc transpose data=want out=want(drop=_name_ curobs);
  by curobs;
  var var;
run;

data want;
  set want;
  var = catx(';',of col:);
  keep var;
run;

proc print data=want;
run;

[EDIT:]

My version for "lazy programmers":

data want;
  set have
  curobs=co
  ;
  curobs=co;
run;

proc transpose data=want out=want(drop=_name_);
  by curobs;
  var var:;
run;

proc sort data=want(where=(col1 is not null)) nodupkey;
  by curobs col1;
run;

proc transpose data=want out=want(drop=_name_ curobs);
  by curobs;
  var col1;
run;

data want;
  set want;
  var = catx(';',of col:);
  keep var;
run;

proc print data=want;
run;

The output in both cases is the same :

将三个变量连接成一个变量并避免重复。

答案2

得分: 1

将你的初始数据在顶部重复三次,每组数据中只保留一个变量。第一个组保留 var_1,第二个保留 var_2,第三个保留 var_3。将每个变量重命名为 var,以便它们堆叠到一个单一的变量中。使用 where 语句删除缺失值,然后使用 proc sort 删除重复项。

data want;
    set have(keep=var_1 rename=(var_1 = var) )
        have(keep=var_2 rename=(var_2 = var) )
        have(keep=var_3 rename=(var_3 = var) )
    ;
    where NOT missing(var);
run;

我们得到以下结果:

var
Bucket_001
Bucket_002
Bucket_003
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_001
Bucket_003
Bucket_005
Bucket_006

现在我们将删除所有重复项,完成了。

proc sort data=want nodupkey;
    by var;
run;

如果你想变得更高级,你可以在一个步骤中使用哈希表来完成所有这些操作。哈希表在读取数据时会自动排序并删除重复的键。

data _null_;
    length var $10.;
    
    set have end=eof;
    array v[*] var_:;
    
    if(_N_ = 1) then do;
        dcl hash h();
            h.defineKey('var');
        h.defineDone();
        call missing(var);
    end;
    
    /* 注意:在使用仅包含键的哈希表的Add()方法中指定键时,尽管它不起作用,但仍需要数据参数 */
    do i = 1 to dim(v);
        if(NOT missing(v[i]) ) then rc = h.Add(key: v[i], data: v[i]);
    end;
    
    if(eof) then h.Output(dataset: 'want');
run;
英文:

Stack your initial data on top of itself three times, keeping only one variable in each set of the data. The first will keep var_1, second var_2, and third var_3. Rename each variable to be var so that they stack into a single variable. Remove missing values with a where statement, then remove duplicates with proc sort.

data want;
    set have(keep=var_1 rename=(var_1 = var) )
        have(keep=var_2 rename=(var_2 = var) )
        have(keep=var_3 rename=(var_3 = var) )
    ;
    where NOT missing(var);
run;

We get this:

var
Bucket_001
Bucket_002
Bucket_003
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_001
Bucket_003
Bucket_005
Bucket_006

Now we'll remove all the duplicates, and we're done.

proc sort data=want nodupkey;
    by var;
run;
var
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006

If you want to get really advanced, you can do this all in a single step with a hash table. Hash tables automatically sort and remove duplicate keys when you read data into them.

data _null_;
    length var $10.;

    set have end=eof;
    array v[*] var_:;

    if(_N_ = 1) then do;
        dcl hash h();
            h.defineKey('var');
        h.defineDone();
        call missing(var);
    end;

    /* Note: when specifying a key in the Add() method with a key-only 
       hash table, the data argument is required even though it does nothing */
    do i = 1 to dim(v);
        if(NOT missing(v[i]) ) then rc = h.Add(key: v[i], data: v[i]);
    end;

    if(eof) then h.Output(dataset: 'want');
run;

答案3

得分: 1

在你的特定示例中,coalescec() 函数也可以使用。不确定它是否适用于你的实际数据,在那种情况下,请使用 Stu 的答案。

data want;
set have;
var = coalescec(var1, var2, var3);
keep var;
run;

SQL 选项,因为我没有看到其中之一发布过。

proc sql;
    create table want as select * from
    (select distinct var_1 as var from temperatures 
     union 
     select distinct var_2 as var from temperatures 
     union 
     select distinct var_3 as var from temperatures
     ) as a 
     where not missing(var);
quit;
英文:

In your particular example, the coalescec() function would also work. Not sure it would work in your actual data, in that case use Stu's answer.

data want;
set have;
var = coalescec(var1, var2, var3);
keep var;
run;

A SQL option as I haven't seen one of those posted.

proc sql;
    create table want as select * from
    (select distinct var_1 as var from temperatures 
     union 
     select distinct var_2 as var from temperatures 
     union 
     select distinct var_3 as var from temperatures
     ) as a 
     where not missing(var);
quit;

答案4

得分: 0

有另一种方法:使用一个临时数组。
临时数组用于存储唯一的值:如果一个值以前从未出现过,则将其添加到数组中,因此在文件末尾,该数组拥有每个唯一值。之后,您只需输出此数组的值,这就是您想要的。

data temperatures;
  infile cards dlm='|';
  informat Var_1 Var_2 Var_3 $42.;
  input Var_1 Var_2 Var_3 $;
  datalines;
Bucket_001 |            | Bucket_001 |
Bucket_002 | Bucket_002 |            |
Bucket_003 | Bucket_004 | Bucket_005 |
           | Bucket_006 |            |
           | Bucket_007 | Bucket_008 |
           |            | Bucket_006 |
;
run;

data want;
  set temperatures end=eof;
  array _var_ Var_1 Var_2 Var_3;
  array _par_[%eval(&sysnobs.*3)]$42. _temporary_;
  do i=1 to dim(_var_);
    if _var_[i] not in _par_ then _par_[_n_*3+i-3]=_var_[i];
  end;

  if eof then do i=1 to dim(_par_);
    if not missing(_par_[i]) then do;
      var=_par_[i];
      output;
    end;
  end;
run;

结果是:

var
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006
Bucket_007
Bucket_008
英文:

There is another method: Using a temporary array.
The temporary array is used to store unique value: If a value never showed before, add it to the array, so this array owns every unique value at the end of file. After that, you just output the values of this array, that is what you want.

data temperatures;
  infile cards dlm='|';
  informat Var_1 Var_2 Var_3 $42.;
  input Var_1 Var_2 Var_3 $;
  datalines;
Bucket_001 |            | Bucket_001 |
Bucket_002 | Bucket_002 |            |
Bucket_003 | Bucket_004 | Bucket_005 |
           | Bucket_006 |            |
           | Bucket_007 | Bucket_008 |
           |            | Bucket_006 |
;
run;

data want;
  set temperatures end=eof;
  array _var_ Var_1 Var_2 Var_3;
  array _par_[%eval(&sysnobs.*3)]$42. _temporary_;
  do i=1 to dim(_var_);
    if _var_[i] not in _par_ then _par_[_n_*3+i-3]=_var_[i];
  end;

  if eof then do i=1 to dim(_par_);
    if not missing(_par_[i]) then do;
      var=_par_[i];
      output;
    end;
  end;
run;

The result is:

var
Bucket_001
Bucket_002
Bucket_003
Bucket_004
Bucket_005
Bucket_006
Bucket_007
Bucket_008

huangapple
  • 本文由 发表于 2023年6月5日 22:09:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76407281.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定