2023年8月9日 11:35:57go评论202阅读模式

英文:

How to map values to new column with upper and lower bounds

问题

给定一个包含两列的数据框，描述记录类型和所处阶段，基于阶段和类型，最符合Python风格的方法是为新列分配唯一值。例如：

d = {'type': ['a','b','c','a','b','c','a','b','c'], 'phase': [-10,3,2,1,-7,-3,-1,-5,4]}
df_ = pd.DataFrame(data=d)
df_

虚构的示例映射带有上限和下限：

a_phase = {
('<', -9):0.001,
-9:0.010,
-8:0.022,
-7:0.026,
-6:0.092,
-5:0.091,
-4:0.082,
-3:0.121,
-2:0.060,
-1:0.105,
0:0.018,
1:0.092,
2:0.092,
3:0.092,
4:0.092,
('>',4):0.000,
}
b_phase = {
('<', -9):0.016,
-9:0.011,
-8:0.021,
-7:0.028,
-6:0.052,
-5:0.075,
-4:0.057,
-3:0.102,
-2:0.238,
-1:0.270,
0:0.034,
1:0.014,
2:0.061,
('>',2):0.000,
}
c_phase = {
('<', -9):0.016,
-9:0.016,
-8:0.011,
-7:0.010,
-6:0.038,
-5:0.015,
-4:0.099,
-3:0.117,
-2:0.216,
-1:0.213,
0:0.008,
1:0.008,
2:0.008,
('>',2):0.000,
}

我可以使用一堆np.where子句来完成，但这感觉非常低效，因此我想向社区寻求建议。我也意识到我不能在字典映射中使用逻辑运算符> <作为键，这仅用于表示目的。

理想的输出将如下所示：

英文:

Given a dataframe with two columns that describe a record type and the phase it is in, what is the most pythonic way of assigning unique values to a new column based on the phase and type? For example:

d = {&#39;type&#39;: [&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,&#39;a&#39;,&#39;b&#39;,&#39;c&#39;,], &#39;phase&#39;: [-10,3,2,1,-7,-3,-1,-5,4]}
df_ = pd.DataFrame(data=d)
df_

Contrived example mappings with upper and lower bounds:

a_phase = {
(&#39;&lt;&#39;, -9):0.001,
-9:0.010,
-8:0.022,
-7:0.026,
-6:0.092,
-5:0.091,
-4:0.082,
-3:0.121,
-2:0.060,
-1:0.105,
0:0.018,
1:0.092,
2:0.092,
3:0.092,
4:0.092,
(&#39;&gt;&#39;,4):0.000,
}
b_phase = {
(&#39;&lt;&#39;, -9):0.016,
-9:0.011,
-8:0.021,
-7:0.028,
-6:0.052,
-5:0.075,
-4:0.057,
-3:0.102,
-2:0.238,
-1:0.270,
0:0.034,
1:0.014,
2:0.061,
(&#39;&gt;&#39;,2):0.000,
}
c_phase = {
(&#39;&lt;&#39;, -9):0.016,
-9:0.016,
-8:0.011,
-7:0.010,
-6:0.038,
-5:0.015,
-4:0.099,
-3:0.117,
-2:0.216,
-1:0.213,
0:0.008,
1:0.008,
2:0.008,
(&#39;&gt;&#39;,2):0.000,
}

I could just go about using a bunch of np.where clauses but that feels extremely inefficient so I wanted to reach out to the community and see if anyone has suggestions. I also realize I can't use logical operators > < as keys in my dictionary mappings, this is for representation purposes only.

The ideal output would look like:

答案1

得分: 2

一种选择是设置一个嵌套字典，并将值映射到每个组，最终使用 cut 来分组范围。如果你有许多字典，这可能不太高效。

另一种选择，如果你稍微改变一下你的字典，可以创建一个 DataFrame 并使用 merge_asof：

Inf = float('inf')
a_phase = {
    -Inf: 0.001,
    -9: 0.010,
    -8: 0.022,
    -7: 0.026,
    -6: 0.092,
    -5: 0.091,
    -4: 0.082,
    -3: 0.121,
    -2: 0.060,
    -1: 0.105,
    0: 0.018,
    1: 0.092,
    2: 0.092,
    3: 0.092,
    4: 0.092,
    4.1: 0.000,
}
b_phase = {
    -Inf: 0.016,
    -9: 0.011,
    -8: 0.021,
    -7: 0.028,
    -6: 0.052,
    -5: 0.075,
    -4: 0.057,
    -3: 0.102,
    -2: 0.238,
    -1: 0.270,
    0: 0.034,
    1: 0.014,
    2: 0.061,
    2.1: 0.000,
}
c_phase = {
    -Inf: 0.016,
    -9: 0.016,
    -8: 0.011,
    -7: 0.010,
    -6: 0.038,
    -5: 0.015,
    -4: 0.099,
    -3: 0.117,
    -2: 0.216,
    -1: 0.213,
    0: 0.008,
    1: 0.008,
    2: 0.008,
    2.1: 0.000,
}
ref = (pd.DataFrame({'a': a_phase, 'b': b_phase, 'c': c_phase})
       .melt(var_name='type', value_name='weight', ignore_index=False)
       .rename_axis('phase').reset_index()
       .sort_values(by='phase')
       .dropna(subset='weight')
     )
out = (pd.merge_asof(df_.reset_index()
                    .astype({'phase': 'float64'})
                    .sort_values(by='phase'),
                 ref, by='type', on='phase')
       .set_index('index').reindex(df_.index)
     )

输出：

  type  phase  weight
0    a  -10.0   0.001
1    b    3.0   0.000
2    c    2.0   0.008
3    a    1.0   0.092
4    b   -7.0   0.028
5    c   -3.0   0.117
6    a   -1.0   0.105
7    b   -5.0   0.075
8    c    4.0   0.000

英文:

One option could be to set up a nested dictionary and map the values per group, eventually using cut to bin the ranges. If you have many dictionaries that might not be very efficient.

Another option, if you change your dictionaries a bit would be to craft a DataFrame and use merge_asof:


Inf = float(&#39;inf&#39;)
a_phase = {
-Inf:0.001,
-9:0.010,
-8:0.022,
-7:0.026,
-6:0.092,
-5:0.091,
-4:0.082,
-3:0.121,
-2:0.060,
-1:0.105,
0:0.018,
1:0.092,
2:0.092,
3:0.092,
4:0.092,
4.1:0.000,
}
b_phase = {
-Inf:0.016,
-9:0.011,
-8:0.021,
-7:0.028,
-6:0.052,
-5:0.075,
-4:0.057,
-3:0.102,
-2:0.238,
-1:0.270,
0:0.034,
1:0.014,
2:0.061,
2.1:0.000,
}
c_phase = {
-Inf:0.016,
-9:0.016,
-8:0.011,
-7:0.010,
-6:0.038,
-5:0.015,
-4:0.099,
-3:0.117,
-2:0.216,
-1:0.213,
0:0.008,
1:0.008,
2:0.008,
2.1:0.000,
}
ref = (pd.DataFrame({&#39;a&#39;: a_phase, &#39;b&#39;: b_phase, &#39;c&#39;: c_phase})
.melt(var_name=&#39;type&#39;, value_name=&#39;weight&#39;, ignore_index=False)
.rename_axis(&#39;phase&#39;).reset_index()
.sort_values(by=&#39;phase&#39;)
.dropna(subset=&#39;weight&#39;)
)
out = (pd.merge_asof(df_.reset_index()
.astype({&#39;phase&#39;: &#39;float64&#39;})
.sort_values(by=&#39;phase&#39;),
ref, by=&#39;type&#39;, on=&#39;phase&#39;)#, direction=&#39;forward&#39;)
.set_index(&#39;index&#39;).reindex(df_.index)
)

Output:

  type  phase  weight
0    a  -10.0   0.001
1    b    3.0   0.000
2    c    2.0   0.008
3    a    1.0   0.092
4    b   -7.0   0.028
5    c   -3.0   0.117
6    a   -1.0   0.105
7    b   -5.0   0.075
8    c    4.0   0.000

答案2

得分: 1

首先，通过元组修改字典以表示更少和更多的值：

a_phase = {
    ('<', -9): 0.001,
    -9: 0.010,
    -8: 0.022,
    -7: 0.026,
    -6: 0.092,
    -5: 0.091,
    -4: 0.082,
    -3: 0.121,
    -2: 0.060,
    -1: 0.105,
    0: 0.018,
    1: 0.092,
    2: 0.092,
    3: 0.092,
    4: 0.092,
    ('>', 4): 0.000,
}
b_phase = {
    ('<', -9): 0.016,
    -9: 0.011,
    -8: 0.021,
    -7: 0.028,
    -6: 0.052,
    -5: 0.075,
    -4: 0.057,
    -3: 0.102,
    -2: 0.238,
    -1: 0.270,
    0: 0.034,
    1: 0.014,
    2: 0.061,
    ('>', 2): 0.000,
}
c_phase = {
    ('<', -9): 0.016,
    -9: 0.016,
    -8: 0.011,
    -7: 0.010,
    -6: 0.038,
    -5: 0.015,
    -4: 0.099,
    -3: 0.117,
    -2: 0.216,
    -1: 0.213,
    0: 0.008,
    1: 0.008,
    2: 0.008,
    ('>', 2): 0.000,
}

然后，通过字典创建DataFrame：

d1 = {'a': a_phase, 'b': b_phase, 'c': c_phase}
df = pd.DataFrame(((k, *x) for k, v in d1.items() for x in v.items()),
                  columns=['type', 'phase', 'weight'])

接下来，左连接匹配的值，不匹配的值用缺失值填充：

df_ = df_.merge(df, how='left')

仅处理不匹配的值，首先提取它们：

mask = df_['weight'].isna()
df1 = (df[pd.to_numeric(df['phase'], errors='coerce').isna()]
        .assign(op=lambda x: x['phase'].str[0], phase=lambda x: x['phase'].str[1]))
print(df1)

然后，根据“type”对其进行筛选和合并，连接在一起并分配给“weight”列，根据掩码筛选：

s1 = (df_[mask].reset_index().merge(df1[df1['op'].eq('<')], how='left', on='type')
        .assign(weight=lambda x: x['weight_y'].where(x['phase_x'].lt(x['phase_y'])))
        .set_index('index')['weight'].dropna())
s2 = (df_[mask].reset_index().merge(df1[df1['op'].eq('>')],
                                    how='left', on='type')
        .assign(weight=lambda x: x['weight_y'].where(x['phase_x'].gt(x['phase_y'])))
        .set_index('index')['weight'].dropna())
df_.loc[mask, 'weight'] = pd.concat([s1, s2])
print(df_)

希望这有助于你理解代码！

英文:

First modify dictioanries by tuples for less and more values:

a_phase = {
(&#39;&lt;&#39;, -9):0.001,
-9:0.010,
-8:0.022,
-7:0.026,
-6:0.092,
-5:0.091,
-4:0.082,
-3:0.121,
-2:0.060,
-1:0.105,
0:0.018,
1:0.092,
2:0.092,
3:0.092,
4:0.092,
(&#39;&gt;&#39;, 4):0.000,
}
b_phase = {
(&#39;&lt;&#39;, -9):0.016,
-9:0.011,
-8:0.021,
-7:0.028,
-6:0.052,
-5:0.075,
-4:0.057,
-3:0.102,
-2:0.238,
-1:0.270,
0:0.034,
1:0.014,
2:0.061,
(&#39;&gt;&#39;, 2):0.000,
}
c_phase = {
(&#39;&lt;&#39;, -9):0.016,
-9:0.016,
-8:0.011,
-7:0.010,
-6:0.038,
-5:0.015,
-4:0.099,
-3:0.117,
-2:0.216,
-1:0.213,
0:0.008,
1:0.008,
2:0.008,
(&#39;&gt;&#39;, 2):0.000,
}

Then create DataFrame by dictionaries:

d1 = {&#39;a&#39;:a_phase,
&#39;b&#39;:b_phase,
&#39;c&#39;:c_phase}
df = pd.DataFrame(((k, *x) for k, v in d1.items() for x in v.items()),
columns=[&#39;type&#39;,&#39;phase&#39;,&#39;weight&#39;])

And left join exact matched values, not matched less and greater ones:

df_ = df_.merge(df, how=&#39;left&#39;)

Processing only not matched values filled missing values, first exctract them:

mask = df_[&#39;weight&#39;].isna()
df1 = (df[pd.to_numeric(df[&#39;phase&#39;], errors=&#39;coerce&#39;).isna()]
.assign(op=lambda x: x[&#39;phase&#39;].str[0], phase=lambda x: x[&#39;phase&#39;].str[1]))
print (df1)
type  phase  weight op
0     a     -9   0.001  &lt;
15    a      4   0.000  &gt;
16    b     -9   0.016  &lt;
29    b      2   0.000  &gt;
30    c     -9   0.016  &lt;
43    c      2   0.000  &gt;

And then filter and merge by type for both, join togethter and assign to column weight filtered by mask:

s1 = (df_[mask].reset_index().merge(df1[df1[&#39;op&#39;].eq(&#39;&lt;&#39;)], how=&#39;left&#39;, on=&#39;type&#39;)
.assign(weight = lambda x: x[&#39;weight_y&#39;].where(x[&#39;phase_x&#39;].lt(x[&#39;phase_y&#39;])))
.set_index(&#39;index&#39;)[&#39;weight&#39;].dropna())
s2 = (df_[mask].reset_index().merge(df1[df1[&#39;op&#39;].eq(&#39;&gt;&#39;)], how=&#39;left&#39;, on=&#39;type&#39;)
.assign(weight = lambda x: x[&#39;weight_y&#39;].where(x[&#39;phase_x&#39;].gt(x[&#39;phase_y&#39;])))
.set_index(&#39;index&#39;)[&#39;weight&#39;].dropna())
df_.loc[mask, &#39;weight&#39;] = pd.concat([s1, s2])
print (df_)
type phase  weight
0    a   -10   0.001
1    b     3   0.000
2    c     2   0.008
3    a     1   0.092
4    b    -7   0.028
5    c    -3   0.117
6    a    -1   0.105
7    b    -5   0.075
8    c     4   0.000

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将值映射到具有上下限的新列

问题

答案1

答案2

日期替换 Pandas

奇怪的时间序列图，在 x 轴添加日期时。

翻译结果：Jit是一个从字典中选择函数的JAX函数。

sqlalchemy.exc.ProgrammingError: (pyodbc.ProgrammingError) ('42S22', "[42S22] [Microsoft]…..SQL Server]Invalid column name '0'

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。