如何根据另一个pandas.Series的索引和值对pandas.Dataframe的列进行分组?

huangapple go评论69阅读模式
英文:

How to group by a pandas.Dataframe's columns based on the indexes and values of another pandas.Series?

问题

我正在尝试根据另一个 pandas.Series 的值和索引将数据框的列分组在一起。该 Series 的索引指的是数据框的列,但它可能包含更多的元素。有什么最好的 Pythonic 方法可以做到这一点?

为了进一步明确,以下是我试图解决的单元测试(使用 pytest):

def test_sum_weights_by_classification_labels_default_arguments():
    portfolio_weights = pd.DataFrame([[0.1, 0.3, 0.4, 0.2],
                                      [0.25, 0.3, 0.25, 0.2],
                                      [0.2, 0.3, 0.1, 0.4]],
                                     index=['2001-01-02', '2001-01-03', '2001-01-04'],
                                     columns=['ABC', 'DEF', 'UVW', 'XYZ'])

    security_classification = pd.Series(['Consumer', 'Energy', 'Consumer', 'Materials', 'Financials', 'Energy'],
                                        index=['ABC', 'DEF', 'GHI', 'RST', 'UVW', 'XYZ'],
                                        name='Classification')

    result_sector_weights = pd.DataFrame([[0.1, 0.5, 0.4],
                                          [0.25, 0.5, 0.25],
                                          [0.2, 0.7, 0.1]],
                                         index=['2001-01-02', '2001-01-03', '2001-01-04'],
                                         columns=['Consumer', 'Energy', 'Financials'])

    pd.testing.assert_frame_equal(clb.sum_weights_by_classification_labels(portfolio_weights, security_classification),
                                  result_sector_weights)

非常感谢!

英文:

I'm trying to group by a dataframe's columns together based on another pandas.Series' values and indexes. The Series' indexes refer to the DataFrame's columns but there could be more elements to it. What is the best pythonic way to do this?

For further clarity, here's the unit test I'm trying to resolve (using pytest):

    def test_sum_weights_by_classification_labels_default_arguments():
    portfolio_weights = pd.DataFrame([[0.1, 0.3, 0.4, 0.2],
                                      [0.25, 0.3, 0.25, 0.2],
                                      [0.2, 0.3, 0.1, 0.4]],
                                     index=['2001-01-02', '2001-01-03', '2001-01-04'],
                                     columns=['ABC', 'DEF', 'UVW', 'XYZ'])

    security_classification = pd.Series(['Consumer', 'Energy', 'Consumer', 'Materials', 'Financials', 'Energy'],
                                        index=['ABC', 'DEF', 'GHI', 'RST', 'UVW', 'XYZ'],
                                        name='Classification')

    result_sector_weights = pd.DataFrame([[0.1, 0.5, 0.4],
                                          [0.25, 0.5, 0.25],
                                          [0.2, 0.7, 0.1]],
                                         index=['2001-01-02', '2001-01-03', '2001-01-04'],
                                         columns=['Consumer', 'Energy', 'Financials'])

    pd.testing.assert_frame_equal(clb.sum_weights_by_classification_labels(portfolio_weights, security_classification),
                                  result_sector_weights)

Many thanks in advance!

答案1

得分: 0

以下是翻译好的内容:

使用pandas.Series.map的解决方案:

def sum_weights_by_classification_labels(security_weights, security_classification):

    classification_weights = security_weights.copy()
    classification_weights.columns = classification_weights.columns.map(security_classification)
    classification_weights = classification_weights.groupby(classification_weights.columns, axis=1).sum()

    return classification_weights

或者使用pandas.DataFrame.merge的解决方案:

def sum_weights_by_classification_labels(security_weights, security_classification):
    
    security_weights_transposed = security_weights.transpose()
    merged_data = security_weights_transposed.merge(security_classification, how='left', left_index=True, 
                                                    right_index=True)
    classification_weights = merged_data.groupby(security_classification.name).sum().transpose()

    return classification_weights

对于第二种解决方案,需要在单元测试中添加以下行,因为不能合并没有名称的Series(添加的列需要有一个名称):

result_sector_weights.columns.name = security_classification.name

希望这对将来有所帮助。

英文:

After further research, I have found a solution. Here's what I came up with using pandas.Series.map on the DataFrame's columns:

def sum_weights_by_classification_labels(security_weights, security_classification):

    classification_weights = security_weights.copy()
    classification_weights.columns = classification_weights.columns.map(security_classification)
    classification_weights = classification_weights.groupby(classification_weights.columns, axis=1).sum()

    return classification_weights

Alternatively using pandas.DataFrame.merge:

def sum_weights_by_classification_labels(security_weights, security_classification):
    
    security_weights_transposed = security_weights.transpose()
    merged_data = security_weights_transposed.merge(security_classification, how='left', left_index=True, 
                                                    right_index=True)
    classification_weights = merged_data.groupby(security_classification.name).sum().transpose()

    return classification_weights

And for the second solution need to add this line to the unit test because cannot merge a Series without a name (the added column needs to have one):

result_sector_weights.columns.name = security_classification.name

I'm keeping this post hoping it might help someone in the future.

This is the way...

huangapple
  • 本文由 发表于 2020年1月3日 23:07:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/59580895.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定