英文:
Vectorization of column search by dynamic value in pandas
问题
I understand your request. Here is the translated code without any additional information:
for col in df.columns:
df['name'] = np.where(col == 'a' + (df['a'].astype('Int16').astype(str)) + '_b' + (df['b'].astype('Int16').astype(str)) + '_name', df[col].values, df['name'])
英文:
I am starting to learn Pandas. And which day I can not solve the fastest way to calculate. How to get for each row the value of a column by a unique name, composed of column 'a', 'b' values?
Below is an example of the initial data.
index | a | b | a1_b1_name | a1_b1_foo_bar | a2_b1_name | a2_b1_foo_bar | a1_b2_name | a1_b2_foo_bar | a2_b2_name | a2_b2_foo_bar | a1_b3_name | a1_b3_foo_bar | a2_b3_name | a2_b3_foo_bar |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | value1 | value2 | value3 | value4 | value5 | value6 | value7 | value8 | value9 | value10 | value11 | value12 |
1 | 2 | 1 | value13 | value14 | value15 | value16 | value17 | value18 | value19 | value20 | value21 | value22 | value23 | value24 |
2 | 2 | 2 | value25 | value26 | value27 | value28 | value29 | value30 | value31 | value32 | value33 | value34 | value35 | value36 |
3 | 1 | 1 | value37 | value38 | value39 | value40 | value41 | value42 | value43 | value44 | value45 | value46 | value47 | value48 |
4 | 2 | 3 | value49 | value50 | value51 | value52 | value53 | value54 | value55 | value56 | value57 | value58 | value59 | value60 |
The number of columns with the values "a _b _name " is planned to be much larger, about 40. The number of rows will be in the tens of thousands.
I need to create a new column 'name' based on the data of the table as quickly as possible and preferably without loops, using the power of pandas vectorization.
Like this one:
index | name | foo_bar |
---|---|---|
0 | value5 | value6 |
1 | value15 | value16 |
2 | value31 | value32 |
3 | value37 | value38 |
4 | value59 | value60 |
I was only able to do this by looping through the columns. But it takes more time than I'd like:
for col in df.columns:
df['name'] = np.where(col == 'a' + (df['a'].astype('Int16').astype(str)) + '_b' + (df['b'].astype('Int16').astype(str)) + '_name', df[col].values, df['name'])
答案1
得分: 1
这是关于索引查找的一种变体,首先需要预处理输入列a/b以匹配列名:
target = 'a' + df['a'].astype(str) + '_b' + df['b'].astype(str) + '_name'
idx, cols = pd.factorize(target)
out = pd.DataFrame({'index': df['index'],
'values': df.reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx],
})
# 或者,对于原始DataFrame中的新列
# df['new'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
输出:
index values
0 0 value3
1 1 value8
2 2 value16
3 3 value19
4 4 value30
中间的target
:
0 a1_b2_name
1 a2_b1_name
2 a2_b2_name
3 a1_b1_name
4 a2_b3_name
dtype: object
多列的情况:
一种选项是重新塑造和合并:
target = 'a' + df['a'].astype(str) + '_b' + df['b'].astype(str)
tmp = df.drop(columns=['index', 'a', 'b'])
tmp.columns = tmp.columns.str.rsplit('_', n=1, expand=True)
out = (df
.reset_index()
.merge(tmp.stack(level=0), left_on=['index', target], right_index=True)
.set_index('index')[['name', 'foo']]
)
输出:
name foo
index
0 value5 value6
1 value15 value16
2 value31 value32
3 value37 value38
4 value59 value60
请注意,这些代码示例中包含了链接,可以点击查看原始问题或了解更多信息。
英文:
original question
Cf. first version of the question
This is a variant on an indexing lookup, you first need to pre-process your input columns a/b to match the column names:
target = 'a'+df['a'].astype(str)+'_b'+df['b'].astype(str)+'_name'
idx, cols = pd.factorize(target)
out = pd.DataFrame({'index': df['index'],
'values': df.reindex(cols, axis=1).to_numpy()
[np.arange(len(df)), idx],
})
# or, for a new column in the original DataFrame
# df['new'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Output:
index values
0 0 value3
1 1 value8
2 2 value16
3 3 value19
4 4 value30
Intermediate target
:
0 a1_b2_name
1 a2_b1_name
2 a2_b2_name
3 a1_b1_name
4 a2_b3_name
dtype: object
multiple columns:
One option is to reshape and merge
:
target = 'a'+df['a'].astype(str)+'_b'+df['b'].astype(str)
tmp = df.drop(columns=['index', 'a', 'b'])
tmp.columns = tmp.columns.str.rsplit('_', n=1, expand=True)
out = (df
.reset_index()
.merge(tmp.stack(level=0), left_on=['index', target], right_index=True)
.set_index('index')[['name', 'foo']]
)
Output:
name foo
index
0 value5 value6
1 value15 value16
2 value31 value32
3 value37 value38
4 value59 value60
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论