英文:
Need to create a column in a dataset based on the value in a different dataset
问题
我有一个数据集,其中包含客户ID和每个ID的均匀随机变量,看起来像这样:
客户ID | 随机均匀变量 |
---|---|
1234 | 0.38223456 |
5678 | 0.7456789 |
*客户ID是唯一的,充当此表的主键
然后我有另一个数据集,其中有一个“结果数字段”和一个累积概率列,看起来像这样:
结果数 | 累积概率 |
---|---|
1 | 0.543678 |
2 | 0.756789 |
我的目标是将客户ID表中的均匀随机变量与第二个表中的累积概率列进行比较,以创建客户表中的新列。
因此,例如,客户表的第一行将在新列中具有值“1”,因为它小于或等于第一个累积概率。第二行将在新列中具有“2”,因为它大于第一个累积概率但小于第二个累积概率,依此类推。
我尝试过使用do循环和SQL查询,但都没有成功。
英文:
I have a dataset that has a client id, and a uniform random variable for each id that looks something like this:
ClientId | RandUnif |
---|---|
1234 | 0.38223456 |
5678 | 0.7456789 |
*client ids are unique and serve as the primary key for this table
I then have another data set that has a “number of outcomes field” and a cumulative probability column that looks like this:
NumOutcomes | CumulativeProb |
---|---|
1 | 0.543678 |
2 | 0.756789 |
My goal is to compare the uniform random variable from my client id table to the cumulative probability column in the second table in order to create a new column in the client table.
So, for instance the first row of the client table would have a value in the new column of “1” because it is less than our equal to the first cumulative probability. The second row would have a “2” in the new column because it’s greater than the first cumulative probability but less than the second, and so on.
I have tried using a do loop, and a sql query to no avail
答案1
得分: 1
您可以执行一个交叉连接(cross-join),其中RandomUnif <= CumulativeProb
。这将为您获得以下表格:
ClientID NumOutcomes RandUnif CumulativeProb
1234 1 0.38223456 0.543678
1234 2 0.38223456 0.756789
5678 2 0.7456789 0.756789
然后,您将按每个ClientID的CumulativeProb
的最小值进行筛选,因为最小值小于或等于第一个累积概率。您可以使用SQL的HAVING
子句来实现这一点:
proc sql;
create table want as
select ClientID
, NumOutcomes
, RandUnif
, CumulativeProb
from have, lookup
where RandUnif LE CumulativeProb
group by ClientID
having CumulativeProb = min(CumulativeProb)
;
quit;
最终结果如下:
ClientID NumOutcomes RandUnif CumulativeProb
1234 1 0.38223456 0.543678
5678 2 0.7456789 0.756789
英文:
You can do a cross-join where RandomUnif <= CumulativeProb
. This would get you a table like this:
ClientID NumOutcomes RandUnif CumulativeProb
1234 1 0.38223456 0.543678
1234 2 0.38223456 0.756789
5678 2 0.7456789 0.756789
From there, you'll take the minimum value of CumulativeProb
by each ClientID since the minimum value is less than or equal to the first cumulative probability. You can do this with a sql having
clause.
proc sql;
create table want as
select ClientID
, NumOutcomes
, RandUnif
, CumulativeProb
from have, lookup
where RandUnif LE CumulativeProb
group by ClientID
having CumulativeProb = min(CumulativeProb)
;
quit;
ClientID NumOutcomes RandUnif CumulativeProb
1234 1 0.38223456 0.543678
5678 2 0.7456789 0.756789
答案2
得分: 0
使用 Stu 的回答,以下是最终对我有效的代码:
proc sql;
create table output as
select clientid
, col2
, other_cols_needed
from (select *
from client_table t1
inner join cum_prob_table t2
on t1.rand_unif <= t2.cumulative_probability)
group by clientid
having cumulative_probability = min(cumulative_probability);
quit;
再次感谢!
英文:
Using the answer from Stu this is what ended up working for me:
proc sql;
create table output as
select clientid
, col2
, other_cols_needed
from (select *
from client_table t1
inner join cum_prob_table t2
on t1.rand_unif <= t2.cumulative_probability)
group by clientid
having cumulative_probability = min(cumulative_probability);
quit;
thanks again!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论