将数据框架旋转以根据特定列将某些行转换为列

huangapple go评论176阅读模式
英文:

Pivoting the dataframe to convert certain rows in to columns base on one specific column

问题

我有一个包含记录的DNS数据集,看起来像下面这样:-

  1. Query_Type Query_name Response_ttl ip4_address NS_Name MX_Name
  2. A google.com 400 1.1.1.1 null null
  3. A google.com 600 2.2.2.2 null null
  4. NS google.com 500 3.3.3.3 ns1.google.com null
  5. MX google.com 400 null null gmail.com
  6. A facebook.com 400 4.4.4.4 null null
  7. NS facebook.com 500 5.5.5.5 ns1.facebook.com null
  8. .
  9. .

我希望预期的输出表格根据Query_name合并所有记录,并且所有其他列都变成如下的行:-

  1. **Query_name Query_type_A_ip Query_type_A_ttl Query_type_NS_Name Query_type_NS_ttl ...**
  2. google.com 1.1.1.1,2.2.2.2 avg(400+600) ns1.google.com 400
  3. facebook.com 4.4.4.4 avrg(ttls of A) ....

我知道pandas库有pivot函数可以做这样的事情。但只是不知道如何做。请帮忙。

英文:

I have a dns dataset containing records that looks like below:-

  1. Query_Type Query_name Response_ttl ip4_address NS_Name MX_Name
  2. A google.com 400 1.1.1.1 null null
  3. A google.com 600 2.2.2.2 null null
  4. NS google.com 500 3.3.3.3 ns1.google.com null
  5. MX google.com 400 null null gmail.com
  6. A facebook.com 400 4.4.4.4 null null
  7. NS facebook.com 500 5.5.5.5 ns1.facebook.com null
  8. .
  9. .

I want the expected output table to merge all records based on Query_name and all other columns to be rows like below:-

  1. **Query_name Query_type_A_ip Query_type_A_ttl Query_type_NS_Name Query_type_NS_ttl ...**
  2. google.com 1.1.1.1,2.2.2.2 avg(400+600) ns1.google.com 400
  3. facebook.com 4.4.4.4 avrg(ttls of A) ....

I know that pandas library has pivot function to do such thing. But just dont know how to do it. Please help

答案1

得分: 3

我认为您首先需要使用 DataFrame.melt 函数,然后使用 DataFrame.dropna 函数删除缺失的行,再使用 DataFrame.pivot_table 函数进行重塑,最后使用 map 函数展平 MultiIndex

  1. df = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
  2. .dropna(subset=['ip'])
  3. .pivot_table(index='Query_name',
  4. columns=['Query_Type', 'variable'],
  5. aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
  6. .sort_index(axis=1, level=[1,2]))
  7. df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}_{x[2]}')
  8. df = df.reset_index()
  9. print (df)
  10. Query_name Response_ttl_A_ip4_address ip_A_ip4_address \
  11. 0 facebook.com 400.0 4.4.4.4
  12. 1 google.com 500.0 1.1.1.1, 2.2.2.2
  13. Response_ttl_MX_MX_Address ip_MX_MX_Address Response_ttl_NS_NS_Address \
  14. 0 NaN NaN 500.0
  15. 1 400.0 gmail.com 500.0
  16. ip_NS_NS_Address Response_ttl_NS_ip4_address ip_NS_ip4_address
  17. 0 ns1.facebook.com 500.0 5.5.5.5
  18. 1 ns1.google.com 500.0 3.3.3.3

或者如果地址列的值不重要,可以省略它们:

  1. df1 = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
  2. .dropna(subset=['ip'])
  3. .pivot_table(index='Query_name',
  4. columns='Query_Type',
  5. aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
  6. .sort_index(axis=1, level=1))
  7. df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
  8. df1 = df1.reset_index()
  9. print (df1)
  10. Query_name Response_ttl_A ip_A Response_ttl_MX ip_MX \
  11. 0 facebook.com 400.0 4.4.4.4 NaN NaN
  12. 1 google.com 500.0 1.1.1.1, 2.2.2.2 400.0 gmail.com
  13. Response_ttl_NS ip_NS
  14. 0 500.0 5.5.5.5, ns1.facebook.com
  15. 1 500.0 3.3.3.3, ns1.google.com
英文:

I think you need DataFrame.melt first, remove missing rows by DataFrame.dropna and reshape by DataFrame.pivot_table, last flatten MultiIndex by map:

  1. df = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
  2. .dropna(subset=['ip'])
  3. .pivot_table(index='Query_name',
  4. columns=['Query_Type', 'variable'],
  5. aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
  6. .sort_index(axis=1, level=[1,2]))
  7. df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}_{x[2]}')
  8. df = df.reset_index()
  9. print (df)
  10. Query_name Response_ttl_A_ip4_address ip_A_ip4_address \
  11. 0 facebook.com 400.0 4.4.4.4
  12. 1 google.com 500.0 1.1.1.1, 2.2.2.2
  13. Response_ttl_MX_MX_Address ip_MX_MX_Address Response_ttl_NS_NS_Address \
  14. 0 NaN NaN 500.0
  15. 1 400.0 gmail.com 500.0
  16. ip_NS_NS_Address Response_ttl_NS_ip4_address ip_NS_ip4_address
  17. 0 ns1.facebook.com 500.0 5.5.5.5
  18. 1 ns1.google.com 500.0 3.3.3.3

Or if address columns values are not important omit them:

  1. df1 = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
  2. .dropna(subset=['ip'])
  3. .pivot_table(index='Query_name',
  4. columns='Query_Type',
  5. aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
  6. .sort_index(axis=1, level=1))
  7. df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
  8. df1 = df1.reset_index()
  9. print (df1)
  10. Query_name Response_ttl_A ip_A Response_ttl_MX ip_MX \
  11. 0 facebook.com 400.0 4.4.4.4 NaN NaN
  12. 1 google.com 500.0 1.1.1.1, 2.2.2.2 400.0 gmail.com
  13. Response_ttl_NS ip_NS
  14. 0 500.0 5.5.5.5, ns1.facebook.com
  15. 1 500.0 3.3.3.3, ns1.google.com

huangapple
  • 本文由 发表于 2020年1月3日 21:01:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/59579101.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定