2020年1月3日 21:01:15go评论176阅读模式

英文:

Pivoting the dataframe to convert certain rows in to columns base on one specific column

问题

我有一个包含记录的DNS数据集，看起来像下面这样：-

    Query_Type    Query_name    Response_ttl    ip4_address  NS_Name         MX_Name
      A             google.com      400            1.1.1.1     null                  null
      A             google.com      600            2.2.2.2     null                  null
      NS            google.com      500            3.3.3.3     ns1.google.com        null
      MX             google.com     400            null        null                  gmail.com
      A             facebook.com    400            4.4.4.4     null                  null
      NS            facebook.com    500            5.5.5.5     ns1.facebook.com      null
      .
      .

我希望预期的输出表格根据Query_name合并所有记录，并且所有其他列都变成如下的行：-


    **Query_name  Query_type_A_ip   Query_type_A_ttl Query_type_NS_Name Query_type_NS_ttl ...**
     google.com    1.1.1.1,2.2.2.2   avg(400+600)      ns1.google.com        400
     facebook.com   4.4.4.4          avrg(ttls of A)  ....

我知道pandas库有pivot函数可以做这样的事情。但只是不知道如何做。请帮忙。

英文:

I have a dns dataset containing records that looks like below:-

Query_Type    Query_name    Response_ttl    ip4_address  NS_Name         MX_Name
  A             google.com      400            1.1.1.1     null                  null
  A             google.com      600            2.2.2.2     null                  null
  NS            google.com      500            3.3.3.3     ns1.google.com        null
  MX             google.com     400            null        null                  gmail.com
  A             facebook.com    400            4.4.4.4     null                  null
  NS            facebook.com    500            5.5.5.5     ns1.facebook.com      null
  .
  .

I want the expected output table to merge all records based on Query_name and all other columns to be rows like below:-

**Query_name  Query_type_A_ip   Query_type_A_ttl Query_type_NS_Name Query_type_NS_ttl ...**
 google.com    1.1.1.1,2.2.2.2   avg(400+600)      ns1.google.com        400
 facebook.com   4.4.4.4          avrg(ttls of A)  ....

I know that pandas library has pivot function to do such thing. But just dont know how to do it. Please help

答案1

得分: 3

我认为您首先需要使用 DataFrame.melt 函数，然后使用 DataFrame.dropna 函数删除缺失的行，再使用 DataFrame.pivot_table 函数进行重塑，最后使用 map 函数展平 MultiIndex：

df = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
        .dropna(subset=['ip'])
        .pivot_table(index='Query_name', 
                     columns=['Query_Type', 'variable'], 
                     aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
        .sort_index(axis=1, level=[1,2]))
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}_{x[2]}')
df = df.reset_index()
print (df)
     Query_name  Response_ttl_A_ip4_address  ip_A_ip4_address  \
0  facebook.com                       400.0           4.4.4.4   
1    google.com                       500.0  1.1.1.1, 2.2.2.2   
   Response_ttl_MX_MX_Address ip_MX_MX_Address  Response_ttl_NS_NS_Address  \
0                         NaN              NaN                       500.0   
1                       400.0        gmail.com                       500.0   
   ip_NS_NS_Address  Response_ttl_NS_ip4_address ip_NS_ip4_address  
0  ns1.facebook.com                        500.0           5.5.5.5  
1    ns1.google.com                        500.0           3.3.3.3

或者如果地址列的值不重要，可以省略它们：

df1 = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
         .dropna(subset=['ip'])
         .pivot_table(index='Query_name', 
                      columns='Query_Type', 
                      aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
         .sort_index(axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
     Query_name  Response_ttl_A              ip_A  Response_ttl_MX      ip_MX  \
0  facebook.com           400.0           4.4.4.4              NaN        NaN   
1    google.com           500.0  1.1.1.1, 2.2.2.2            400.0  gmail.com   
   Response_ttl_NS                      ip_NS  
0            500.0  5.5.5.5, ns1.facebook.com  
1            500.0    3.3.3.3, ns1.google.com

英文:

I think you need DataFrame.melt first, remove missing rows by DataFrame.dropna and reshape by DataFrame.pivot_table, last flatten MultiIndex by map:

df = (df.melt([&#39;Query_Type&#39;,&#39;Query_name&#39;,&#39;Response_ttl&#39;], value_name=&#39;ip&#39;)
        .dropna(subset=[&#39;ip&#39;])
        .pivot_table(index=&#39;Query_name&#39;, 
                     columns=[&#39;Query_Type&#39;, &#39;variable&#39;], 
                     aggfunc={&#39;Response_ttl&#39;:&#39;mean&#39;,&#39;ip&#39;: lambda x: &#39;, &#39;.join(x.astype(str))})
        .sort_index(axis=1, level=[1,2]))
df.columns = df.columns.map(lambda x: f&#39;{x[0]}_{x[1]}_{x[2]}&#39;)
df = df.reset_index()
print (df)
     Query_name  Response_ttl_A_ip4_address  ip_A_ip4_address  \
0  facebook.com                       400.0           4.4.4.4   
1    google.com                       500.0  1.1.1.1, 2.2.2.2   
   Response_ttl_MX_MX_Address ip_MX_MX_Address  Response_ttl_NS_NS_Address  \
0                         NaN              NaN                       500.0   
1                       400.0        gmail.com                       500.0   
   ip_NS_NS_Address  Response_ttl_NS_ip4_address ip_NS_ip4_address  
0  ns1.facebook.com                        500.0           5.5.5.5  
1    ns1.google.com                        500.0           3.3.3.3

Or if address columns values are not important omit them:

df1 = (df.melt([&#39;Query_Type&#39;,&#39;Query_name&#39;,&#39;Response_ttl&#39;], value_name=&#39;ip&#39;)
         .dropna(subset=[&#39;ip&#39;])
         .pivot_table(index=&#39;Query_name&#39;, 
                      columns=&#39;Query_Type&#39;, 
                      aggfunc={&#39;Response_ttl&#39;:&#39;mean&#39;,&#39;ip&#39;: lambda x: &#39;, &#39;.join(x.astype(str))})
         .sort_index(axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f&#39;{x[0]}_{x[1]}&#39;)
df1 = df1.reset_index()
print (df1)
     Query_name  Response_ttl_A              ip_A  Response_ttl_MX      ip_MX  \
0  facebook.com           400.0           4.4.4.4              NaN        NaN   
1    google.com           500.0  1.1.1.1, 2.2.2.2            400.0  gmail.com   
   Response_ttl_NS                      ip_NS  
0            500.0  5.5.5.5, ns1.facebook.com  
1            500.0    3.3.3.3, ns1.google.com

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将数据框架旋转以根据特定列将某些行转换为列

问题

答案1

在具有大量嵌套文件夹的项目中存在相对导入问题。

Groupby, Window and rolling average in Spark

如何在使用GEKKO时对方程加入这些条件？

从列表列中根据条件移除元素

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。