英文:
Is there a SAS function to flag a word that repeats in order across columns?
问题
在序列中没有其他单词或中间缺失的情况下,是否有一种方法来标记包含单词'Add'的行?
我尝试使用数组语句和查找函数,但没有成功!
英文:
Is there a way to flag rows where the word 'Add' is in sequence without any other word or missing in between?
I tried the array statement with the find function, but no luck!
答案1
得分: 1
这段代码将查找所有包含至少两个连续的 Add
的序列,并将所有这些序列保存到一个以逗号分隔的单个变量中。
示例数据:
data have;
input t1$ t2$ t3$ t4$ t5$ t6$ t7$ t8$ t9$ t10$;
datalines;
Add Add No Add No Add . No Add .
Add No Add Add Add Add . . No .
Add Add Add No Add Add Add Add . .
;
run;
代码:
data want;
set have;
array t[*] t:;
array col[10] $;
length sequences $50.;
/* 检查当前值和前一个值是否为 'Add' */
do i = 1 to dim(t);
if(i > 1 AND t[i] = 'Add' AND t[i-1] = 'Add') then do;
col[i] = vname(t[i]);
col[i-1] = vname(t[i-1]);
end;
end;
/* 为每个序列创建逗号分隔的列表。例如:
t1-t3,t3-t5
t1-t4
等等
*/
flag_start = 0;
do i = 1 to dim(col);
/* 找到序列的起始位置 */
if(col[i] NE ' ' AND NOT flag_start) then do;
seq_start = col[i];
flag_start = 1;
end;
/* 找到序列的结束位置 */
if(col[i] = ' ' AND flag_start) then do;
seq_end = col[i-1];
flag_start = 0;
end;
/* 如果我们在序列之间,计算序列范围并保存它 */
if(i > 1 AND col[i] = ' ' AND col[i-1] NE ' ') then do;
seq_range = cats(seq_start, '-', seq_end);
sequences = catx(',', sequences, seq_range);
end;
end;
drop i flag_start seq_start seq_end seq_range col:;
run;
输出:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequences
Add Add No Add No Add No Add t1-t2
Add No Add Add Add Add No t3-t6
Add Add Add No Add Add Add Add t1-t3,t5-t8
英文:
This code will find all sequences of Add
where there are at least two Add
s in a row and save all of the sequences to a single comma-separated variable.
Sample data:
data have;
input t1$ t2$ t3$ t4$ t5$ t6$ t7$ t8$ t9$ t10$;
datalines;
Add Add No Add No Add . No Add .
Add No Add Add Add Add . . No .
Add Add Add No Add Add Add Add . .
;
run;
Code:
data want;
set have;
array t[*] t:;
array col[10] $;
length sequences $50.;
/* Check if the current and previous value is 'Add' */
do i = 1 to dim(t);
if(i > 1 AND t[i] = 'Add' AND t[i-1] = 'Add') then do;
col[i] = vname(t[i]);
col[i-1] = vname(t[i-1]);
end;
end;
/* Create a comma-separated list for each sequence. For example:
t1-t3,t3-t5
t1-t4
etc.
*/
flag_start = 0;
do i = 1 to dim(col);
/* Find the start of the sequence */
if(col[i] NE ' ' AND NOT flag_start) then do;
seq_start = col[i];
flag_start = 1;
end;
/* Find the end of the sequence */
if(col[i] = ' ' AND flag_start) then do;
seq_end = col[i-1];
flag_start = 0;
end;
/* If we are between sequences, calculate the sequence range and save it */
if(i > 1 AND col[i] = ' ' AND col[i-1] NE ' ') then do;
seq_range = cats(seq_start, '-', seq_end);
sequences = catx(',', sequences, seq_range);
end;
end;
drop i flag_start seq_start seq_end seq_range col:;
run;
Output:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequences
Add Add No Add No Add No Add t1-t2
Add No Add Add Add Add No t3-t6
Add Add Add No Add Add Add Add t1-t3,t5-t8
答案2
得分: 1
以下是代码中需要翻译的部分:
"The presence of a target word at a T<index> column can be flagged using a binary value, setting the bits appropriately."
中文翻译:
在 T<index> 列上存在目标单词时,可以使用二进制值进行标记,设置位数相应地。
"Flag up to 32 columns. For more than 32 columns you would need additional flag variables and some extra bookkeeping when calculating the flag value."
中文翻译:
标记最多 32 列。如果超过 32 列,您需要额外的标记变量以及在计算标记值时的一些额外记录。
英文:
The presence of a target word at a T<index> column can be flagged using a binary value, setting the bits appropriately.
Example:
Flag up to 32 columns. For more than 32 columns you would need additional flag variables and some extra bookkeeping when calculating the flag value.
data have;
input (t1-t10) ($);
datalines;
Add Add No Add No Add . No Add .
Add No Add Add Add Add . . No .
Add Add Add No Add Add Add Add . .
;
data want;
set have;
array ts t1-t10;
flag = 0;
do over ts;
flag = BOR (flag, BLSHIFT(ts='Add', _i_-1));
end;
format flag binary32.;
run;
答案3
得分: 0
Solution 1: 有效的序列从T1开始,直到有一个间断。
* 有效序列从t1开始,直到有一个间断;
data want1;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add'; * 如果T1或T2不等于'Add',则继续下一条观测;
array t(*) t1-t10;
do i = 3 to dim(t); * 从t3开始循环,因为我们知道t1和t2都是'Add';
if t[i] ne 'Add' then do;
sequence = cats('T1-T', put(i-1, 2.));
output;
leave; * 退出循环,移至下一条观测;
end;
end;
drop i;
run;
结果:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequence
Add Add No Add No Add No Add T1-T2
Add Add Add No Add Add Add Add T1-T3
Solution 2: 下一个解决方案仍然检测从T1开始的有效序列,但允许间断和第一个序列之后的其他序列。
* 序列从t1开始,带有一个间断,同一行上还发生了另一个序列;
data want2;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add'; * 如果T1或T2不等于'Add',则移至下一条观测;
array t(*) t1-t10;
seq_strt = 1; * 序列的开始。从1开始,因为有子集的if条件;
break = 0; * 用于标记序列中的间断。从0开始,因为有子集的if条件;
sequence = '';
do i = 3 to dim(t); * 从t3开始循环,因为我们知道t1和t2都是'Add';
* 序列的开始 - 在间断期间出现2个'Add';
if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do; * 新序列的开始;
break = 0;
seq_strt = i-1;
end;
* 序列的结束;
else if break = 0 and t[i] ne 'Add' then do;
break = 1; * 标记间断;
sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
end;
end;
drop i seq_strt break;
run;
结果:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequence
Add Add No Add No Add No Add T1-T2
Add Add Add No Add Add Add Add T1-T3,T5-T8
最后,最后一个解决方案检测任何时间段内的任何序列。
* 在任何时间段捕获任何序列;
data want3;
set have;
length sequence $20;
array t(*) t1-t10;
seq_strt = 0; * 序列的开始;
break = 1; * 用于标记序列中的间断。从间断开始,直到找到新序列为止;
sequence = '';
do i = 2 to dim(t); * 从t2开始循环,以便与t1比较;
* 序列的开始 - 在间断期间出现2个'Add';
if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do; * 新序列的开始;
break = 0;
seq_strt = i-1;
end;
* 序列的结束;
else if break = 0 and t[i] ne 'Add' then do;
break = 1; * 标记间断;
sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
end;
end;
drop i seq_strt break;
run;
结果:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequence
Add Add No Add No Add No Add T1-T2
Add No Add Add Add Add No T3-T6
Add Add Add No Add Add Add Add T1-T3,T5-T8
英文:
I have a few solutions depending on when and how many sequences are allowed.
First, a sequence is defined as 2 or more consecutive time periods with 'Add'. For my solutions I used Richard's sample data.
Solution 1: Valid sequences begin at T1 until a break
* valid sequence begins at t1 until a break;
data want1;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add'; * if either T1 or T2 <> 'Add' then move on to next obs;
array t(*) t1-t10;
do i = 3 to dim(t); * start loop at t3 since we know t1 & t2 = 'Add';
if t[i] ne 'Add' then do;
sequence = cats('T1-T', put(i-1, 2.));
output;
leave; * exit loop. move to next obs;
end;
end;
drop i;
run;
Result:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequence
Add Add No Add No Add No Add T1-T2
Add Add Add No Add Add Add Add T1-T3
Solution 2: The next solution still detects valid sequences beginning at T1, but allows breaks and other sequences beyond the first one.
* sequence begins at t1 with a break and another sequence occurs on same row;
data want2;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add'; * if either T1 or T2 <> 'Add' then move to next obs;
array t(*) t1-t10;
seq_strt = 1; * start of sequence. start at 1 because of subsetting if;
break = 0; * flag for break in sequence. start at 0 because of subsetting if;
sequence = '';
do i = 3 to dim(t); * start loop at t3 since we know t1 & t2 = 'Add';
* start of sequence - 2 consecutive 'Add' during break;
if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do; * start of new sequence;
break = 0;
seq_strt = i-1;
end;
* end of sequence;
else if break = 0 and t[i] ne 'Add' then do;
break = 1; * flag a break;
sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
end;
end;
drop i seq_strt break;
run;
Result:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequence
Add Add No Add No Add No Add T1-T2
Add Add Add No Add Add Add Add T1-T3,T5-T8
Finally, the last solution detects any sequence in any time period.
* capture any sequence at any period of time;
data want3;
set have;
length sequence $20;
array t(*) t1-t10;
seq_strt = 0; * start of sequence;
break = 1; * flag for break in sequence. start with break until new seq is found;
sequence = '';
do i = 2 to dim(t); * start loop at t2 to compare at t1;
* start of sequence - 2 consecutive 'Add' during break;
if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do; * start of new sequence;
break = 0;
seq_strt = i-1;
end;
* end of sequence;
else if break = 0 and t[i] ne 'Add' then do;
break = 1; * flag a break;
sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
end;
end;
drop i seq_strt break;
run;
Result:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequence
Add Add No Add No Add No Add T1-T2
Add No Add Add Add Add No T3-T6
Add Add Add No Add Add Add Add T1-T3,T5-T8
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论