英文:
Pivoting the dataframe to convert certain rows in to columns base on one specific column
问题
我有一个包含记录的DNS数据集,看起来像下面这样:-
Query_Type Query_name Response_ttl ip4_address NS_Name MX_Name
A google.com 400 1.1.1.1 null null
A google.com 600 2.2.2.2 null null
NS google.com 500 3.3.3.3 ns1.google.com null
MX google.com 400 null null gmail.com
A facebook.com 400 4.4.4.4 null null
NS facebook.com 500 5.5.5.5 ns1.facebook.com null
.
.
我希望预期的输出表格根据Query_name合并所有记录,并且所有其他列都变成如下的行:-
**Query_name Query_type_A_ip Query_type_A_ttl Query_type_NS_Name Query_type_NS_ttl ...**
google.com 1.1.1.1,2.2.2.2 avg(400+600) ns1.google.com 400
facebook.com 4.4.4.4 avrg(ttls of A) ....
我知道pandas库有pivot函数可以做这样的事情。但只是不知道如何做。请帮忙。
英文:
I have a dns dataset containing records that looks like below:-
Query_Type Query_name Response_ttl ip4_address NS_Name MX_Name
A google.com 400 1.1.1.1 null null
A google.com 600 2.2.2.2 null null
NS google.com 500 3.3.3.3 ns1.google.com null
MX google.com 400 null null gmail.com
A facebook.com 400 4.4.4.4 null null
NS facebook.com 500 5.5.5.5 ns1.facebook.com null
.
.
I want the expected output table to merge all records based on Query_name and all other columns to be rows like below:-
**Query_name Query_type_A_ip Query_type_A_ttl Query_type_NS_Name Query_type_NS_ttl ...**
google.com 1.1.1.1,2.2.2.2 avg(400+600) ns1.google.com 400
facebook.com 4.4.4.4 avrg(ttls of A) ....
I know that pandas library has pivot function to do such thing. But just dont know how to do it. Please help
答案1
得分: 3
我认为您首先需要使用 DataFrame.melt
函数,然后使用 DataFrame.dropna
函数删除缺失的行,再使用 DataFrame.pivot_table
函数进行重塑,最后使用 map
函数展平 MultiIndex
:
df = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
.dropna(subset=['ip'])
.pivot_table(index='Query_name',
columns=['Query_Type', 'variable'],
aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
.sort_index(axis=1, level=[1,2]))
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}_{x[2]}')
df = df.reset_index()
print (df)
Query_name Response_ttl_A_ip4_address ip_A_ip4_address \
0 facebook.com 400.0 4.4.4.4
1 google.com 500.0 1.1.1.1, 2.2.2.2
Response_ttl_MX_MX_Address ip_MX_MX_Address Response_ttl_NS_NS_Address \
0 NaN NaN 500.0
1 400.0 gmail.com 500.0
ip_NS_NS_Address Response_ttl_NS_ip4_address ip_NS_ip4_address
0 ns1.facebook.com 500.0 5.5.5.5
1 ns1.google.com 500.0 3.3.3.3
或者如果地址列的值不重要,可以省略它们:
df1 = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
.dropna(subset=['ip'])
.pivot_table(index='Query_name',
columns='Query_Type',
aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
.sort_index(axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
Query_name Response_ttl_A ip_A Response_ttl_MX ip_MX \
0 facebook.com 400.0 4.4.4.4 NaN NaN
1 google.com 500.0 1.1.1.1, 2.2.2.2 400.0 gmail.com
Response_ttl_NS ip_NS
0 500.0 5.5.5.5, ns1.facebook.com
1 500.0 3.3.3.3, ns1.google.com
英文:
I think you need DataFrame.melt
first, remove missing rows by DataFrame.dropna
and reshape by DataFrame.pivot_table
, last flatten MultiIndex
by map
:
df = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
.dropna(subset=['ip'])
.pivot_table(index='Query_name',
columns=['Query_Type', 'variable'],
aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
.sort_index(axis=1, level=[1,2]))
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}_{x[2]}')
df = df.reset_index()
print (df)
Query_name Response_ttl_A_ip4_address ip_A_ip4_address \
0 facebook.com 400.0 4.4.4.4
1 google.com 500.0 1.1.1.1, 2.2.2.2
Response_ttl_MX_MX_Address ip_MX_MX_Address Response_ttl_NS_NS_Address \
0 NaN NaN 500.0
1 400.0 gmail.com 500.0
ip_NS_NS_Address Response_ttl_NS_ip4_address ip_NS_ip4_address
0 ns1.facebook.com 500.0 5.5.5.5
1 ns1.google.com 500.0 3.3.3.3
Or if address columns values are not important omit them:
df1 = (df.melt(['Query_Type','Query_name','Response_ttl'], value_name='ip')
.dropna(subset=['ip'])
.pivot_table(index='Query_name',
columns='Query_Type',
aggfunc={'Response_ttl':'mean','ip': lambda x: ', '.join(x.astype(str))})
.sort_index(axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.reset_index()
print (df1)
Query_name Response_ttl_A ip_A Response_ttl_MX ip_MX \
0 facebook.com 400.0 4.4.4.4 NaN NaN
1 google.com 500.0 1.1.1.1, 2.2.2.2 400.0 gmail.com
Response_ttl_NS ip_NS
0 500.0 5.5.5.5, ns1.facebook.com
1 500.0 3.3.3.3, ns1.google.com
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论