将数据框架旋转以根据特定列将某些行转换为列

huangapple go评论124阅读模式
英文:

Pivoting the dataframe to convert certain rows in to columns base on one specific column

问题

我有一个包含记录的DNS数据集,看起来像下面这样:-

    Query_Type    Query_name    Response_ttl    ip4_address  NS_Name         MX_Name
      A             google.com      400            1.1.1.1     null                  null
      A             google.com      600            2.2.2.2     null                  null
      NS            google.com      500            3.3.3.3     ns1.google.com        null
      MX             google.com     400            null        null                  gmail.com
      A             facebook.com    400            4.4.4.4     null                  null
      NS            facebook.com    500            5.5.5.5     ns1.facebook.com      null
      .
      .

我希望预期的输出表格根据Query_name合并所有记录,并且所有其他列都变成如下的行:-


    **Query_name  Query_type_A_ip   Query_type_A_ttl Query_type_NS_Name Query_type_NS_ttl ...**
     google.com    1.1.1.1,2.2.2.2   avg(400+600)      ns1.google.com        400
     facebook.com   4.4.4.4          avrg(ttls of A)  ....

我知道pandas库有pivot函数可以做这样的事情。但只是不知道如何做。请帮忙。

英文:

I have a dns dataset containing records that looks like below:-

Query_Type    Query_name    Response_ttl    ip4_address  NS_Name         MX_Name
  A             google.com      400            1.1.1.1     null                  null
  A             google.com      600            2.2.2.2     null                  null
  NS            google.com      500            3.3.3.3     ns1.google.com        null
  MX             google.com     400            null        null                  gmail.com
  A             facebook.com    400            4.4.4.4     null                  null
  NS            facebook.com    500            5.5.5.5     ns1.facebook.com      null
  .
  .

I want the expected output table to merge all records based on Query_name and all other columns to be rows like below:-

**Query_name  Query_type_A_ip   Query_type_A_ttl Query_type_NS_Name Query_type_NS_ttl ...**
 google.com    1.1.1.1,2.2.2.2   avg(400+600)      ns1.google.com        400
 facebook.com   4.4.4.4          avrg(ttls of A)  ....

I know that pandas library has pivot function to do such thing. But just dont know how to do it. Please help

答案1

得分: 3

我认为您首先需要使用 DataFrame.melt 函数,然后使用 DataFrame.dropna 函数删除缺失的行,再使用 DataFrame.pivot_table 函数进行重塑,最后使用 map 函数展平 MultiIndex

df = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
        .dropna(subset=['ip'])
        .pivot_table(index='Query_name', 
                     columns=['Query_Type', 'variable'], 
                     aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
        .sort_index(axis=1, level=[1,2]))
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}_{x[2]}')
df = df.reset_index()
print (df)
     Query_name  Response_ttl_A_ip4_address  ip_A_ip4_address  \
0  facebook.com                       400.0           4.4.4.4   
1    google.com                       500.0  1.1.1.1, 2.2.2.2   

   Response_ttl_MX_MX_Address ip_MX_MX_Address  Response_ttl_NS_NS_Address  \
0                         NaN              NaN                       500.0   
1                       400.0        gmail.com                       500.0   

   ip_NS_NS_Address  Response_ttl_NS_ip4_address ip_NS_ip4_address  
0  ns1.facebook.com                        500.0           5.5.5.5  
1    ns1.google.com                        500.0           3.3.3.3  

或者如果地址列的值不重要,可以省略它们:

df1 = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
         .dropna(subset=['ip'])
         .pivot_table(index='Query_name', 
                      columns='Query_Type', 
                      aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
         .sort_index(axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
     Query_name  Response_ttl_A              ip_A  Response_ttl_MX      ip_MX  \
0  facebook.com           400.0           4.4.4.4              NaN        NaN   
1    google.com           500.0  1.1.1.1, 2.2.2.2            400.0  gmail.com   

   Response_ttl_NS                      ip_NS  
0            500.0  5.5.5.5, ns1.facebook.com  
1            500.0    3.3.3.3, ns1.google.com
英文:

I think you need DataFrame.melt first, remove missing rows by DataFrame.dropna and reshape by DataFrame.pivot_table, last flatten MultiIndex by map:

df = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
        .dropna(subset=['ip'])
        .pivot_table(index='Query_name', 
                     columns=['Query_Type', 'variable'], 
                     aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
        .sort_index(axis=1, level=[1,2]))
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}_{x[2]}')
df = df.reset_index()
print (df)
     Query_name  Response_ttl_A_ip4_address  ip_A_ip4_address  \
0  facebook.com                       400.0           4.4.4.4   
1    google.com                       500.0  1.1.1.1, 2.2.2.2   

   Response_ttl_MX_MX_Address ip_MX_MX_Address  Response_ttl_NS_NS_Address  \
0                         NaN              NaN                       500.0   
1                       400.0        gmail.com                       500.0   

   ip_NS_NS_Address  Response_ttl_NS_ip4_address ip_NS_ip4_address  
0  ns1.facebook.com                        500.0           5.5.5.5  
1    ns1.google.com                        500.0           3.3.3.3  

Or if address columns values are not important omit them:

df1 = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
         .dropna(subset=['ip'])
         .pivot_table(index='Query_name', 
                      columns='Query_Type', 
                      aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
         .sort_index(axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
     Query_name  Response_ttl_A              ip_A  Response_ttl_MX      ip_MX  \
0  facebook.com           400.0           4.4.4.4              NaN        NaN   
1    google.com           500.0  1.1.1.1, 2.2.2.2            400.0  gmail.com   

   Response_ttl_NS                      ip_NS  
0            500.0  5.5.5.5, ns1.facebook.com  
1            500.0    3.3.3.3, ns1.google.com  

huangapple
  • 本文由 发表于 2020年1月3日 21:01:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/59579101.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定