Pandas 中根据动态值进行列搜索的向量化处理

huangapple go评论68阅读模式
英文:

Vectorization of column search by dynamic value in pandas

问题

I understand your request. Here is the translated code without any additional information:

for col in df.columns:
     df['name'] = np.where(col == 'a' + (df['a'].astype('Int16').astype(str)) + '_b' + (df['b'].astype('Int16').astype(str)) + '_name', df[col].values, df['name'])
英文:

I am starting to learn Pandas. And which day I can not solve the fastest way to calculate. How to get for each row the value of a column by a unique name, composed of column 'a', 'b' values?

Below is an example of the initial data.

index a b a1_b1_name a1_b1_foo_bar a2_b1_name a2_b1_foo_bar a1_b2_name a1_b2_foo_bar a2_b2_name a2_b2_foo_bar a1_b3_name a1_b3_foo_bar a2_b3_name a2_b3_foo_bar
0 1 2 value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 value11 value12
1 2 1 value13 value14 value15 value16 value17 value18 value19 value20 value21 value22 value23 value24
2 2 2 value25 value26 value27 value28 value29 value30 value31 value32 value33 value34 value35 value36
3 1 1 value37 value38 value39 value40 value41 value42 value43 value44 value45 value46 value47 value48
4 2 3 value49 value50 value51 value52 value53 value54 value55 value56 value57 value58 value59 value60

The number of columns with the values "a _b _name " is planned to be much larger, about 40. The number of rows will be in the tens of thousands.

I need to create a new column 'name' based on the data of the table as quickly as possible and preferably without loops, using the power of pandas vectorization.

Like this one:

index name foo_bar
0 value5 value6
1 value15 value16
2 value31 value32
3 value37 value38
4 value59 value60

I was only able to do this by looping through the columns. But it takes more time than I'd like:

for col in df.columns:
     df['name'] = np.where(col == 'a' + (df['a'].astype('Int16').astype(str)) + '_b' + (df['b'].astype('Int16').astype(str)) + '_name', df[col].values, df['name'])

答案1

得分: 1

这是关于索引查找的一种变体,首先需要预处理输入列a/b以匹配列名:

target = 'a' + df['a'].astype(str) + '_b' + df['b'].astype(str) + '_name'

idx, cols = pd.factorize(target)

out = pd.DataFrame({'index': df['index'],
                    'values': df.reindex(cols, axis=1).to_numpy()
                              [np.arange(len(df)), idx],
                    })

# 或者,对于原始DataFrame中的新列
# df['new'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

输出:

   index   values
0      0   value3
1      1   value8
2      2  value16
3      3  value19
4      4  value30

中间的target:

0    a1_b2_name
1    a2_b1_name
2    a2_b2_name
3    a1_b1_name
4    a2_b3_name
dtype: object

多列的情况:
一种选项是重新塑造和合并:

target = 'a' + df['a'].astype(str) + '_b' + df['b'].astype(str)

tmp = df.drop(columns=['index', 'a', 'b'])
tmp.columns = tmp.columns.str.rsplit('_', n=1, expand=True)

out = (df
   .reset_index()
   .merge(tmp.stack(level=0), left_on=['index', target], right_index=True)
   .set_index('index')[['name', 'foo']]
)

输出:

          name      foo
index                  
0       value5   value6
1      value15  value16
2      value31  value32
3      value37  value38
4      value59  value60

请注意,这些代码示例中包含了链接,可以点击查看原始问题或了解更多信息。

英文:

original question

Cf. first version of the question

This is a variant on an indexing lookup, you first need to pre-process your input columns a/b to match the column names:

target = 'a'+df['a'].astype(str)+'_b'+df['b'].astype(str)+'_name'

idx, cols = pd.factorize(target)

out = pd.DataFrame({'index': df['index'],
                    'values': df.reindex(cols, axis=1).to_numpy()
                              [np.arange(len(df)), idx],
                    })

# or, for a new column in the original DataFrame
# df['new'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

Output:

   index   values
0      0   value3
1      1   value8
2      2  value16
3      3  value19
4      4  value30

Intermediate target:

0    a1_b2_name
1    a2_b1_name
2    a2_b2_name
3    a1_b1_name
4    a2_b3_name
dtype: object

multiple columns:

One option is to reshape and merge:

target = 'a'+df['a'].astype(str)+'_b'+df['b'].astype(str)

tmp = df.drop(columns=['index', 'a', 'b'])
tmp.columns = tmp.columns.str.rsplit('_', n=1, expand=True)

out = (df
   .reset_index()
   .merge(tmp.stack(level=0), left_on=['index', target], right_index=True)
   .set_index('index')[['name', 'foo']]
)

Output:

          name      foo
index                  
0       value5   value6
1      value15  value16
2      value31  value32
3      value37  value38
4      value59  value60

huangapple
  • 本文由 发表于 2023年6月26日 19:26:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76556232.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定