英文:
Optimal way to take a network of nodes and interpolate missing values
问题
我有一个下面的示例数据框:
data = [
[1, 2, 100, 4342],
[3, 4, 100, 999],
[5, 6, 500, 4339],
[4, 5, 300, 999],
[12, 13, 100, 4390],
[6, 7, 600, 4335],
[2, 3, 200, 4341],
[10, 11, 100, 4400],
[11, 12, 200, 999],
[7, 8, 200, 4332]
]
df = pd.DataFrame(data, columns=['Node', 'Dwn_Node', 'Dwn_Length', 'Elevation'])
df = df.replace(999, np.nan)
其中Node
列描述了当前节点的名称,Dwn_Node
描述了'下游'节点的名称。Elevation
描述了当前节点的海拔高度,Dwn_Length
描述了到'下游'节点的距离。我不太确定如何完成这个任务,但目标是使用斜率来插值缺失的值。我认为networkx库中可能有一个函数或更好的功能,但对该库非常不熟悉。
上述数据集是一个示例数据集,但节点顺序是不正确的。
我考虑的一种方法是分离未知节点的前一个和后一个节点,即:
data1 = [
[12, 13, 100, 4390],
[10, 11, 100, 4400],
[11, 12, 200, 999]
]
通过计算data1
的斜率,方法是在节点10和11的Dwn_Length
之和下,用节点10和12的海拔高度差值来计算,然后将该斜率应用于节点10的Dwn_Length
以插值出节点11的海拔高度。然而,对于一个具有许多缺失节点值的网络数据集来说,这似乎非常繁琐。
英文:
I have an example data frame below:
data = [
[1, 2, 100, 4342],
[3, 4, 100, 999],
[5, 6, 500, 4339],
[4, 5, 300, 999],
[12, 13, 100, 4390],
[6, 7, 600, 4335],
[2, 3, 200, 4341],
[10,11, 100, 4400],
[11,12, 200, 999],
[7, 8, 200, 4332]
]
df = pd.DataFrame(data, columns = ['Node','Dwn_Node', 'Dwn_Length','Elevation'])
df = df.replace(999, np.nan)
Where the Node
column describes the name of the current node and Dwn_Node
describes the name of the node 'down stream'. Elevation
describes the elevation of the current node and Dwn_Length
describes the length to the 'down stream' node. I am really not sure of the best way to complete this, but the goal would be to interpolate the missing values using slope. I am thinking there might be a function or better capability in networkx but am very unfamiliar with that library.
The above data set is an example data set but is accurate in that the node order is out of place.
One way I thought of would be to separate the previous and subsequent nodes of the unknown nodes i.e.
data1 = [
[12, 13, 100, 4390],
[10,11, 100, 4400],
[11,12, 200, 999]
]
Calculate slope from data1
by taking the sum of Dwn_Length
of nodes 10 and 11 under the difference in elevation values of node 10 and 12 then apply that slope to interpolate the elevation of node 11 given the Dwn_Length
of node 10. This seems very tedious for a data set that has many sets of missing node values within a network though.
答案1
得分: 2
你可以提高这个过程的速度,但这应该有效:
import networkx as nx
# 从你的数据框创建网络
G = nx.from_pandas_edgelist(df, source='Node', target='Dwn_Node',
edge_attr='Dwn_Length', create_using=nx.DiGraph)
# 对于每个子图
for nbunch in nx.connected_components(G.to_undirected()):
H = nx.subgraph(G, nbunch)
roots = [n for n, d in H.in_degree if d == 0]
leaves = [n for n, d in H.out_degree if d == 0]
for root in roots:
for leaf in leaves:
for path in nx.all_simple_paths(H, root, leaf):
# 提取并排序子图
sort = lambda x: np.searchsorted(path, x)
df1 = df[df['Node'].isin(path)].sort_values('Node', key=sort)
df1['Distance'] = df1['Dwn_Length'].cumsum()
# 分段线性插值
m = df1['Elevation'].isna()
x = df1.loc[m, 'Distance']
xp = df1.loc[~m, 'Distance']
yp = df1.loc[~m, 'Elevation']
y = np.interp(x, xp, yp)
df1.loc[m, 'Elevation'] = y
# 更新缺失值
df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])
如果我理解正确,我不确定你是否需要使用networkx
来完成这个任务。需要插值来填充缺失值,但因为每个点之间的距离不是均匀间隔的,所以需要使用分段线性插值。numpy
提供了 interp
方法。但是,你需要修改你的输入数据框:按节点排序并计算Dwn_Length
的累积和。然后你就有了所有的x(Distance
)和y(Elevation
)值,以计算缺失值的插值:
# 准备工作
df1 = df.sort_values('Node').assign(Distance=lambda x: x['Dwn_Length'].cumsum())
# 分段线性插值
m = df1['Elevation'].isna()
x = df1.loc[m, 'Distance']
xp = df1.loc[~m, 'Distance']
yp = df1.loc[~m, 'Elevation']
y = np.interp(x, xp, yp)
# 可视化
df1.loc[m, 'Elevation'] = y
df1.plot(x='Distance', y='Elevation', ylabel='Elevation', marker='o', legend=False)
plt.show()
输出:
>>> df1
Node Dwn_Node Dwn_Length Elevation Distance
0 1 2 100 4342.000000 100
6 2 3 200 4341.000000 300
1 3 4 100 4340.777778 400
3 4 5 300 4340.111111 700
2 5 6 500 4339.000000 1200
5 6 7 600 4335.000000 1800
9 7 8 200 4332.000000 2000
7 10 11 100 4400.000000 2100
8 11 12 200 4393.333333 2300
4 12 13 100 4390.000000 2400
显然,由于你的索引从df
到df1
没有改变,你可以从df1
中填充缺失值到df
:
df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])
输出:
>>> df
Node Dwn_Node Dwn_Length Elevation
0 1 2 100 4342.000000
1 3 4 100 4340.777778
2 5 6 500 4339.000000
3 4 5 300 4340.111111
4 12 13 100 4390.000000
5 6 7 600 4335.000000
6 2 3 200 4341.000000
7 10 11 100 4400.000000
8 11 12 200 4393.333333
9 7 8 200 4332.000000
英文:
You can probably enhanced the speed of the process but this should work:
import networkx as nx
# Create network from your dataframe
G = nx.from_pandas_edgelist(df, source='Node', target='Dwn_Node',
edge_attr='Dwn_Length', create_using=nx.DiGraph)
# nx.set_node_attributes(G, df.set_index('Node')[['Elevation']].to_dict())
# For each subgraph
for nbunch in nx.connected_components(G.to_undirected()):
H = nx.subgraph(G, nbunch)
roots = [n for n, d in H.in_degree if d == 0]
leaves = [n for n, d in H.out_degree if d == 0]
for root in roots:
for leaf in leaves:
for path in nx.all_simple_paths(H, root, leaf):
# Extract and sort subgraph
sort = lambda x: np.searchsorted(path, x)
df1 = df[df['Node'].isin(path)].sort_values('Node', key=sort)
df1['Distance'] = df1['Dwn_Length'].cumsum()
# Piecewise linear interpolation
m = df1['Elevation'].isna()
x = df1.loc[m, 'Distance']
xp = df1.loc[~m, 'Distance']
yp = df1.loc[~m, 'Elevation']
y = np.interp(x, xp, yp)
df1.loc[m, 'Elevation'] = y
# Update missing values
df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])
IIUC, I'm not really sure you need networkx
for this task. Interpolation is required to fill missing values but not linear because the distance between each point is not evenly spaced. You have to use piecewise linear interpolation. numpy
provides the interp
method. However you have to modify your input dataframe: sort it by node and compute the cumulative sum of Dwn_Length
. After that you have all x (Distance
) and y (Elevation
) values to compute interpolation for missing values:
# Preparation
df1 = df.sort_values('Node').assign(Distance=lambda x: x['Dwn_Length'].cumsum())
# Piecewise linear interpolation
m = df1['Elevation'].isna()
x = df1.loc[m, 'Distance']
xp = df1.loc[~m, 'Distance']
yp = df1.loc[~m, 'Elevation']
y = np.interp(x, xp, yp)
# Visualization
df1.loc[m, 'Elevation'] = y
df1.plot(x='Distance', y='Elevation', ylabel='Elevation', marker='o', legend=False)
plt.show()
Output:
>>> df1
Node Dwn_Node Dwn_Length Elevation Distance
0 1 2 100 4342.000000 100
6 2 3 200 4341.000000 300
1 3 4 100 4340.777778 400
3 4 5 300 4340.111111 700
2 5 6 500 4339.000000 1200
5 6 7 600 4335.000000 1800
9 7 8 200 4332.000000 2000
7 10 11 100 4400.000000 2100
8 11 12 200 4393.333333 2300
4 12 13 100 4390.000000 2400
Obviously, as your index has not changed from df
to df1
, you can fill missing values from df1
:
df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])
Output:
>>> df
Node Dwn_Node Dwn_Length Elevation
0 1 2 100 4342.000000
1 3 4 100 4340.777778
2 5 6 500 4339.000000
3 4 5 300 4340.111111
4 12 13 100 4390.000000
5 6 7 600 4335.000000
6 2 3 200 4341.000000
7 10 11 100 4400.000000
8 11 12 200 4393.333333
9 7 8 200 4332.000000
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论