如何将值映射到具有上下界的新列中

huangapple go评论90阅读模式
英文:

How to map values to new column with upper and lower bounds

问题

给定一个包含两列描述记录类型和所处阶段的数据框,基于阶段和类型,最符合Python风格的方法是如何为新列分配唯一值?例如:

d = {'type': ['a','b','c','a','b','c','a','b','c'], 'phase': [-10,3,2,1,-7,-3,-1,-5,4]}
df_ = pd.DataFrame(data=d)
df_

如何将值映射到具有上下界的新列中

虚构的示例映射与上下界:

a_phase = {
('<', -9):0.001,
-9:0.010,
-8:0.022,
-7:0.026,
-6:0.092,
-5:0.091,
-4:0.082,
-3:0.121,
-2:0.060,
-1:0.105,
0:0.018,
1:0.092,
2:0.092,
3:0.092,
4:0.092,
('>',4):0.000,
}

b_phase = {
('<', -9):0.016,
-9:0.011,
-8:0.021,
-7:0.028,
-6:0.052,
-5:0.075,
-4:0.057,
-3:0.102,
-2:0.238,
-1:0.270,
0:0.034,
1:0.014,
2:0.061,
('>',2):0.000,
}

c_phase = {
('<', -9):0.016,
-9:0.016,
-8:0.011,
-7:0.010,
-6:0.038,
-5:0.015,
-4:0.099,
-3:0.117,
-2:0.216,
-1:0.213,
0:0.008,
1:0.008,
2:0.008,
('>',2):0.000,
}

我可以使用一堆np.where子句来完成,但这感觉非常低效,所以我想向社区寻求建议。我也意识到我不能在字典映射中使用逻辑运算符><作为键,这仅用于表示目的。

理想的输出应该如下所示:

如何将值映射到具有上下界的新列中

英文:

Given a dataframe with two columns that describe a record type and the phase it is in, what is the most pythonic way of assigning unique values to a new column based on the phase and type? For example:

d = {&#39;type&#39;: [&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,], &#39;phase&#39;: [-10,3,2,1,-7,-3,-1,-5,4]}
df_ = pd.DataFrame(data=d)
df_

如何将值映射到具有上下界的新列中

Contrived example mappings with upper and lower bounds:

a_phase = {
(&#39;&lt;&#39;, -9):0.001,
-9:0.010,
-8:0.022,
-7:0.026,
-6:0.092,
-5:0.091,
-4:0.082,
-3:0.121,
-2:0.060,
-1:0.105,
0:0.018,
1:0.092,
2:0.092,
3:0.092,
4:0.092,
(&#39;&gt;&#39;,4):0.000,
}

b_phase = {
(&#39;&lt;&#39;, -9):0.016,
-9:0.011,
-8:0.021,
-7:0.028,
-6:0.052,
-5:0.075,
-4:0.057,
-3:0.102,
-2:0.238,
-1:0.270,
0:0.034,
1:0.014,
2:0.061,
(&#39;&gt;&#39;,2):0.000,
}

c_phase = {
(&#39;&lt;&#39;, -9):0.016,
-9:0.016,
-8:0.011,
-7:0.010,
-6:0.038,
-5:0.015,
-4:0.099,
-3:0.117,
-2:0.216,
-1:0.213,
0:0.008,
1:0.008,
2:0.008,
(&#39;&gt;&#39;,2):0.000,
}

I could just go about using a bunch of np.where clauses but that feels extremely inefficient so I wanted to reach out to the community and see if anyone has suggestions. I also realize I can't use logical operators &gt; &lt; as keys in my dictionary mappings, this is for representation purposes only.

The ideal output would look like:

如何将值映射到具有上下界的新列中

答案1

得分: 2

一种选择是设置一个嵌套字典,并根据组映射值,最后使用cut将范围分组。如果有很多字典,这种方法可能不太高效。

另一种选择是,如果你稍微修改一下字典,可以创建一个DataFrame,并使用merge_asof函数:

Inf = float('inf')

a_phase = {
-Inf:0.001,
-9:0.010,
-8:0.022,
-7:0.026,
-6:0.092,
-5:0.091,
-4:0.082,
-3:0.121,
-2:0.060,
-1:0.105,
0:0.018,
1:0.092,
2:0.092,
3:0.092,
4:0.092,
4.1:0.000,
}
b_phase = {
-Inf:0.016,
-9:0.011,
-8:0.021,
-7:0.028,
-6:0.052,
-5:0.075,
-4:0.057,
-3:0.102,
-2:0.238,
-1:0.270,
0:0.034,
1:0.014,
2:0.061,
2.1:0.000,
}
c_phase = {
-Inf:0.016,
-9:0.016,
-8:0.011,
-7:0.010,
-6:0.038,
-5:0.015,
-4:0.099,
-3:0.117,
-2:0.216,
-1:0.213,
0:0.008,
1:0.008,
2:0.008,
2.1:0.000,
}

ref = (pd.DataFrame({'a': a_phase, 'b': b_phase, 'c': c_phase})
         .melt(var_name='type', value_name='weight', ignore_index=False)
         .rename_axis('phase').reset_index()
         .sort_values(by='phase')
         .dropna(subset='weight')
       )

out = (pd.merge_asof(df_.reset_index()
                        .astype({'phase': 'float64'})
                        .sort_values(by='phase'),
                     ref, by='type', on='phase')#, direction='forward')
         .set_index('index').reindex(df_.index)
       )

输出结果:

  type  phase  weight
0    a  -10.0   0.001
1    b    3.0   0.000
2    c    2.0   0.008
3    a    1.0   0.092
4    b   -7.0   0.028
5    c   -3.0   0.117
6    a   -1.0   0.105
7    b   -5.0   0.075
8    c    4.0   0.000
英文:

One option could be to set up a nested dictionary and map the values per group, eventually using cut to bin the ranges. If you have many dictionaries that might not be very efficient.

Another option, if you change your dictionaries a bit would be to craft a DataFrame and use merge_asof:


Inf = float(&#39;inf&#39;)

a_phase = {
-Inf:0.001,
-9:0.010,
-8:0.022,
-7:0.026,
-6:0.092,
-5:0.091,
-4:0.082,
-3:0.121,
-2:0.060,
-1:0.105,
0:0.018,
1:0.092,
2:0.092,
3:0.092,
4:0.092,
4.1:0.000,
}
b_phase = {
-Inf:0.016,
-9:0.011,
-8:0.021,
-7:0.028,
-6:0.052,
-5:0.075,
-4:0.057,
-3:0.102,
-2:0.238,
-1:0.270,
0:0.034,
1:0.014,
2:0.061,
2.1:0.000,
}
c_phase = {
-Inf:0.016,
-9:0.016,
-8:0.011,
-7:0.010,
-6:0.038,
-5:0.015,
-4:0.099,
-3:0.117,
-2:0.216,
-1:0.213,
0:0.008,
1:0.008,
2:0.008,
2.1:0.000,
}

ref = (pd.DataFrame({&#39;a&#39;: a_phase, &#39;b&#39;: b_phase, &#39;c&#39;: c_phase})
         .melt(var_name=&#39;type&#39;, value_name=&#39;weight&#39;, ignore_index=False)
         .rename_axis(&#39;phase&#39;).reset_index()
         .sort_values(by=&#39;phase&#39;)
         .dropna(subset=&#39;weight&#39;)
       )

out = (pd.merge_asof(df_.reset_index()
                        .astype({&#39;phase&#39;: &#39;float64&#39;})
                        .sort_values(by=&#39;phase&#39;),
                     ref, by=&#39;type&#39;, on=&#39;phase&#39;)#, direction=&#39;forward&#39;)
         .set_index(&#39;index&#39;).reindex(df_.index)
       )

Output:

  type  phase  weight
0    a  -10.0   0.001
1    b    3.0   0.000
2    c    2.0   0.008
3    a    1.0   0.092
4    b   -7.0   0.028
5    c   -3.0   0.117
6    a   -1.0   0.105
7    b   -5.0   0.075
8    c    4.0   0.000

答案2

得分: 1

首先,通过元组修改字典中的键值对,以便处理较小和较大的值:

a_phase = {
    ('<', -9): 0.001,
    -9: 0.010,
    -8: 0.022,
    -7: 0.026,
    -6: 0.092,
    -5: 0.091,
    -4: 0.082,
    -3: 0.121,
    -2: 0.060,
    -1: 0.105,
    0: 0.018,
    1: 0.092,
    2: 0.092,
    3: 0.092,
    4: 0.092,
    ('>', 4): 0.000,
}

b_phase = {
    ('<', -9): 0.016,
    -9: 0.011,
    -8: 0.021,
    -7: 0.028,
    -6: 0.052,
    -5: 0.075,
    -4: 0.057,
    -3: 0.102,
    -2: 0.238,
    -1: 0.270,
    0: 0.034,
    1: 0.014,
    2: 0.061,
    ('>', 2): 0.000,
}

c_phase = {
    ('<', -9): 0.016,
    -9: 0.016,
    -8: 0.011,
    -7: 0.010,
    -6: 0.038,
    -5: 0.015,
    -4: 0.099,
    -3: 0.117,
    -2: 0.216,
    -1: 0.213,
    0: 0.008,
    1: 0.008,
    2: 0.008,
    ('>', 2): 0.000,
}

然后,通过字典创建DataFrame:

d1 = {'a': a_phase,
      'b': b_phase,
      'c': c_phase}

df = pd.DataFrame(((k, *x) for k, v in d1.items() for x in v.items()),
                  columns=['type', 'phase', 'weight'])

接下来,使用左连接将精确匹配的值合并,不匹配的较小和较大值不合并:

df_ = df_.merge(df, how='left')

只处理未匹配的值,填充缺失值,首先提取它们:

mask = df_['weight'].isna()

df1 = (df[pd.to_numeric(df['phase'], errors='coerce').isna()]
        .assign(op=lambda x: x['phase'].str[0], phase=lambda x: x['phase'].str[1]))

print(df1)

然后根据type筛选和合并两个DataFrame,将它们连接在一起,并将过滤后的weight赋值给mask

s1 = (df_[mask].reset_index().merge(df1[df1['op'].eq('<')], how='left', on='type')
        .assign(weight=lambda x: x['weight_y'].where(x['phase_x'].lt(x['phase_y'])))
        .set_index('index')['weight'].dropna())

s2 = (df_[mask].reset_index().merge(df1[df1['op'].eq('>')], how='left', on='type')
        .assign(weight=lambda x: x['weight_y'].where(x['phase_x'].gt(x['phase_y'])))
        .set_index('index')['weight'].dropna())

df_.loc[mask, 'weight'] = pd.concat([s1, s2])
print(df_)

以上是给定代码的翻译结果。

英文:

First modify dictioanries by tuples for less and more values:

a_phase = {
(&#39;&lt;&#39;, -9):0.001,
-9:0.010,
-8:0.022,
-7:0.026,
-6:0.092,
-5:0.091,
-4:0.082,
-3:0.121,
-2:0.060,
-1:0.105,
0:0.018,
1:0.092,
2:0.092,
3:0.092,
4:0.092,
(&#39;&gt;&#39;, 4):0.000,
}

b_phase = {
(&#39;&lt;&#39;, -9):0.016,
-9:0.011,
-8:0.021,
-7:0.028,
-6:0.052,
-5:0.075,
-4:0.057,
-3:0.102,
-2:0.238,
-1:0.270,
0:0.034,
1:0.014,
2:0.061,
(&#39;&gt;&#39;, 2):0.000,
}

c_phase = {
(&#39;&lt;&#39;, -9):0.016,
-9:0.016,
-8:0.011,
-7:0.010,
-6:0.038,
-5:0.015,
-4:0.099,
-3:0.117,
-2:0.216,
-1:0.213,
0:0.008,
1:0.008,
2:0.008,
(&#39;&gt;&#39;, 2):0.000,
}

Then create DataFrame by dictionaries:

d1 = {&#39;a&#39;:a_phase,
      &#39;b&#39;:b_phase,
      &#39;c&#39;:c_phase}

df = pd.DataFrame(((k, *x) for k, v in d1.items() for x in v.items()),
                  columns=[&#39;type&#39;,&#39;phase&#39;,&#39;weight&#39;])

And left join exact matched values, not matched less and greater ones:

df_ = df_.merge(df, how=&#39;left&#39;)

Processing only not matched values filled missing values, first exctract them:

mask = df_[&#39;weight&#39;].isna()

df1 = (df[pd.to_numeric(df[&#39;phase&#39;], errors=&#39;coerce&#39;).isna()]
            .assign(op=lambda x: x[&#39;phase&#39;].str[0], phase=lambda x: x[&#39;phase&#39;].str[1]))

print (df1)

   type  phase  weight op
0     a     -9   0.001  &lt;
15    a      4   0.000  &gt;
16    b     -9   0.016  &lt;
29    b      2   0.000  &gt;
30    c     -9   0.016  &lt;
43    c      2   0.000  &gt;

And then filter and merge by type for both, join togethter and assign to column weight filtered by mask:

s1 = (df_[mask].reset_index().merge(df1[df1[&#39;op&#39;].eq(&#39;&lt;&#39;)], how=&#39;left&#39;, on=&#39;type&#39;)
                .assign(weight = lambda x: x[&#39;weight_y&#39;].where(x[&#39;phase_x&#39;].lt(x[&#39;phase_y&#39;])))
                .set_index(&#39;index&#39;)[&#39;weight&#39;].dropna())

s2 = (df_[mask].reset_index().merge(df1[df1[&#39;op&#39;].eq(&#39;&gt;&#39;)], how=&#39;left&#39;, on=&#39;type&#39;)
                .assign(weight = lambda x: x[&#39;weight_y&#39;].where(x[&#39;phase_x&#39;].gt(x[&#39;phase_y&#39;])))
                .set_index(&#39;index&#39;)[&#39;weight&#39;].dropna())

df_.loc[mask, &#39;weight&#39;] = pd.concat([s1, s2])
print (df_)
  type phase  weight
0    a   -10   0.001
1    b     3   0.000
2    c     2   0.008
3    a     1   0.092
4    b    -7   0.028
5    c    -3   0.117
6    a    -1   0.105
7    b    -5   0.075
8    c     4   0.000

huangapple
  • 本文由 发表于 2023年8月9日 11:35:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76864402.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定