有没有SAS函数可以标识在多列中以顺序重复的单词?

huangapple go评论56阅读模式
英文:

Is there a SAS function to flag a word that repeats in order across columns?

问题

在序列中没有其他单词或中间缺失的情况下,是否有一种方法来标记包含单词'Add'的行?

我尝试使用数组语句和查找函数,但没有成功!

英文:

有没有SAS函数可以标识在多列中以顺序重复的单词?

Is there a way to flag rows where the word 'Add' is in sequence without any other word or missing in between?

I tried the array statement with the find function, but no luck!

答案1

得分: 1

这段代码将查找所有包含至少两个连续的 Add 的序列,并将所有这些序列保存到一个以逗号分隔的单个变量中。

示例数据:

data have;
    input t1$ t2$ t3$ t4$ t5$ t6$ t7$ t8$ t9$ t10$;
    datalines;
Add Add No Add No Add . No Add .
Add No Add Add Add Add . . No .
Add Add Add No Add Add Add Add . .
;
run;

代码:

data want;
    set have;

    array t[*] t:;
    array col[10] $;
    length sequences $50.;

    /* 检查当前值和前一个值是否为 'Add' */
    do i = 1 to dim(t);
        if(i > 1 AND t[i] = 'Add' AND t[i-1] = 'Add') then do;
            col[i]   = vname(t[i]);
            col[i-1] = vname(t[i-1]);
        end;
    end;

    /* 为每个序列创建逗号分隔的列表。例如:
       t1-t3,t3-t5
       t1-t4
       等等
    */
    flag_start = 0;

    do i = 1 to dim(col);

        /* 找到序列的起始位置 */
        if(col[i] NE ' ' AND NOT flag_start) then do;
            seq_start  = col[i];
            flag_start = 1;
        end;

        /* 找到序列的结束位置 */
        if(col[i] = ' ' AND flag_start) then do;
            seq_end    = col[i-1];
            flag_start = 0;
        end;

        /* 如果我们在序列之间,计算序列范围并保存它 */
        if(i > 1 AND col[i] = ' ' AND col[i-1] NE ' ') then do;
            seq_range = cats(seq_start, '-', seq_end);
            sequences = catx(',', sequences, seq_range);
        end;
    end;

    drop i flag_start seq_start seq_end seq_range col:;
run;

输出:

t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	sequences
Add	Add	No	Add	No	Add		No	Add		t1-t2
Add	No	Add	Add	Add	Add			No		t3-t6
Add	Add	Add	No	Add	Add	Add	Add			t1-t3,t5-t8
英文:

This code will find all sequences of Add where there are at least two Adds in a row and save all of the sequences to a single comma-separated variable.

Sample data:

data have;
    input t1$ t2$ t3$ t4$ t5$ t6$ t7$ t8$ t9$ t10$;
    datalines;
Add Add No Add No Add . No Add .
Add No Add Add Add Add . . No .
Add Add Add No Add Add Add Add . .
;
run;

Code:

data want;
    set have;

    array t[*] t:;
    array col[10] $;
    length sequences $50.;

    /* Check if the current and previous value is 'Add' */
    do i = 1 to dim(t);
        if(i > 1 AND t[i] = 'Add' AND t[i-1] = 'Add') then do;
            col[i]   = vname(t[i]);
            col[i-1] = vname(t[i-1]);
        end;
    end;

    /* Create a comma-separated list for each sequence. For example:
       t1-t3,t3-t5
       t1-t4
       etc.
    */
    flag_start = 0;

    do i = 1 to dim(col);
        
        /* Find the start of the sequence */
        if(col[i] NE ' ' AND NOT flag_start) then do;
            seq_start  = col[i];
            flag_start = 1;
        end;

        /* Find the end of the sequence */
        if(col[i] = ' ' AND flag_start) then do;
            seq_end    = col[i-1];
            flag_start = 0;
        end;

        /* If we are between sequences, calculate the sequence range and save it */
        if(i > 1 AND col[i] = ' ' AND col[i-1] NE ' ') then do;
            seq_range = cats(seq_start, '-', seq_end);
            sequences = catx(',', sequences, seq_range);
        end;
    end;

    drop i flag_start seq_start seq_end seq_range col:;
run;

Output:

t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	sequences
Add	Add	No	Add	No	Add		No	Add		t1-t2
Add	No	Add	Add	Add	Add			No		t3-t6
Add	Add	Add	No	Add	Add	Add	Add			t1-t3,t5-t8

答案2

得分: 1

以下是代码中需要翻译的部分:

"The presence of a target word at a T<index> column can be flagged using a binary value, setting the bits appropriately."

中文翻译:
在 T<index> 列上存在目标单词时,可以使用二进制值进行标记,设置位数相应地。

"Flag up to 32 columns. For more than 32 columns you would need additional flag variables and some extra bookkeeping when calculating the flag value."

中文翻译:
标记最多 32 列。如果超过 32 列,您需要额外的标记变量以及在计算标记值时的一些额外记录。

英文:

The presence of a target word at a T<index> column can be flagged using a binary value, setting the bits appropriately.

Example:

Flag up to 32 columns. For more than 32 columns you would need additional flag variables and some extra bookkeeping when calculating the flag value.

data have;
    input (t1-t10) ($);
    datalines;
Add Add No Add No Add . No Add .
Add No Add Add Add Add . . No .
Add Add Add No Add Add Add Add . .
;


data want;
  set have;
  array ts t1-t10;
  flag = 0;
  do over ts;
    flag = BOR (flag, BLSHIFT(ts='Add', _i_-1));
  end;

  format flag binary32.;
run;

有没有SAS函数可以标识在多列中以顺序重复的单词?

答案3

得分: 0

Solution 1: 有效的序列从T1开始,直到有一个间断。

* 有效序列从t1开始,直到有一个间断;
data want1;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add';  * 如果T1或T2不等于'Add',则继续下一条观测;

array t(*) t1-t10;  

do i = 3 to dim(t);  * 从t3开始循环,因为我们知道t1和t2都是'Add';
    if t[i] ne 'Add' then do;
        sequence = cats('T1-T', put(i-1, 2.));
        output;
        leave;  * 退出循环,移至下一条观测;
    end;
end;
drop i;
run;

结果:

t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	sequence
Add	Add	No	Add	No	Add		No	Add		T1-T2
Add	Add	Add	No	Add	Add	Add	Add			T1-T3

Solution 2: 下一个解决方案仍然检测从T1开始的有效序列,但允许间断和第一个序列之后的其他序列。

* 序列从t1开始,带有一个间断,同一行上还发生了另一个序列;
data want2;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add';  * 如果T1或T2不等于'Add',则移至下一条观测;

array t(*) t1-t10;

seq_strt = 1;  * 序列的开始。从1开始,因为有子集的if条件;
break = 0;     * 用于标记序列中的间断。从0开始,因为有子集的if条件;
sequence = '';

do i = 3 to dim(t);  * 从t3开始循环,因为我们知道t1和t2都是'Add';
    * 序列的开始 - 在间断期间出现2个'Add';
    if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do;  * 新序列的开始;
        break = 0;
        seq_strt = i-1;
    end;
    * 序列的结束;
    else if break = 0 and t[i] ne 'Add' then do;
        break = 1;  * 标记间断;
        sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
    end;
end;
drop i seq_strt break;
run;

结果:

t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	sequence
Add	Add	No	Add	No	Add		No	Add		T1-T2
Add	Add	Add	No	Add	Add	Add	Add			T1-T3,T5-T8

最后,最后一个解决方案检测任何时间段内的任何序列。

* 在任何时间段捕获任何序列;
data want3;
set have;
length sequence $20;

array t(*) t1-t10;

seq_strt = 0;  * 序列的开始;
break = 1;     * 用于标记序列中的间断。从间断开始,直到找到新序列为止;
sequence = '';

do i = 2 to dim(t);  * 从t2开始循环,以便与t1比较;
    * 序列的开始 - 在间断期间出现2个'Add';
    if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do;  * 新序列的开始;
        break = 0;
        seq_strt = i-1;
    end;
    * 序列的结束;
    else if break = 0 and t[i] ne 'Add' then do;
        break = 1;  * 标记间断;
        sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
    end;
end;
drop i seq_strt break;
run;

结果:

t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	sequence
Add	Add	No	Add	No	Add		No	Add		T1-T2
Add	No	Add	Add	Add	Add			No		T3-T6
Add	Add	Add	No	Add	Add	Add	Add			T1-T3,T5-T8
英文:

I have a few solutions depending on when and how many sequences are allowed.

First, a sequence is defined as 2 or more consecutive time periods with 'Add'. For my solutions I used Richard's sample data.

Solution 1: Valid sequences begin at T1 until a break

* valid sequence begins at t1 until a break;
data want1;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add';  * if either T1 or T2 <> 'Add' then move on to next obs;

array t(*) t1-t10;  

do i = 3 to dim(t);  * start loop at t3 since we know t1 & t2 = 'Add';
    if t[i] ne 'Add' then do;
        sequence = cats('T1-T', put(i-1, 2.));
        output;
        leave;  * exit loop. move to next obs;
    end;
end;
drop i;
run;

Result:

t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	sequence
Add	Add	No	Add	No	Add		No	Add		T1-T2
Add	Add	Add	No	Add	Add	Add	Add			T1-T3

Solution 2: The next solution still detects valid sequences beginning at T1, but allows breaks and other sequences beyond the first one.

* sequence begins at t1 with a break and another sequence occurs on same row;
data want2;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add';  * if either T1 or T2 <> 'Add' then move to next obs;

array t(*) t1-t10;

seq_strt = 1;  * start of sequence. start at 1 because of subsetting if;
break = 0;     * flag for break in sequence. start at 0 because of subsetting if;
sequence = '';

do i = 3 to dim(t);  * start loop at t3 since we know t1 & t2 = 'Add';
    * start of sequence - 2 consecutive 'Add' during break;
    if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do;  * start of new sequence;
        break = 0;
        seq_strt = i-1;
    end;
    * end of sequence;
    else if break = 0 and t[i] ne 'Add' then do;
        break = 1;  * flag a break;
        sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
    end;
end;
drop i seq_strt break;
run;

Result:

t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	sequence
Add	Add	No	Add	No	Add		No	Add		T1-T2
Add	Add	Add	No	Add	Add	Add	Add			T1-T3,T5-T8

Finally, the last solution detects any sequence in any time period.

* capture any sequence at any period of time;
data want3;
set have;
length sequence $20;

array t(*) t1-t10;

seq_strt = 0;  * start of sequence;
break = 1;     * flag for break in sequence. start with break until new seq is found;
sequence = '';

do i = 2 to dim(t);  * start loop at t2 to compare at t1;
    * start of sequence - 2 consecutive 'Add' during break;
    if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do;  * start of new sequence;
        break = 0;
        seq_strt = i-1;
    end;
    * end of sequence;
    else if break = 0 and t[i] ne 'Add' then do;
        break = 1;  * flag a break;
        sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
    end;
end;
drop i seq_strt break;
run;

Result:

t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	sequence
Add	Add	No	Add	No	Add		No	Add		T1-T2
Add	No	Add	Add	Add	Add			No		T3-T6
Add	Add	Add	No	Add	Add	Add	Add			T1-T3,T5-T8

huangapple
  • 本文由 发表于 2023年2月10日 06:03:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404907.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定