在网络节点之间插值缺失数值的最佳方法

huangapple go评论65阅读模式
英文:

Optimal way to take a network of nodes and interpolate missing values

问题

我有一个下面的示例数据框:

data = [
    [1, 2, 100, 4342],
    [3, 4, 100, 999],
    [5, 6, 500, 4339],
    [4, 5, 300, 999],
    [12, 13, 100, 4390],
    [6, 7, 600, 4335],
    [2, 3, 200, 4341],
    [10, 11, 100, 4400],
    [11, 12, 200, 999],
    [7, 8, 200, 4332]
]
df = pd.DataFrame(data, columns=['Node', 'Dwn_Node', 'Dwn_Length', 'Elevation'])
df = df.replace(999, np.nan)

其中Node列描述了当前节点的名称,Dwn_Node描述了'下游'节点的名称。Elevation描述了当前节点的海拔高度,Dwn_Length描述了到'下游'节点的距离。我不太确定如何完成这个任务,但目标是使用斜率来插值缺失的值。我认为networkx库中可能有一个函数或更好的功能,但对该库非常不熟悉。

上述数据集是一个示例数据集,但节点顺序是不正确的。

我考虑的一种方法是分离未知节点的前一个和后一个节点,即:

data1 = [
    [12, 13, 100, 4390],
    [10, 11, 100, 4400],
    [11, 12, 200, 999]
]

通过计算data1的斜率,方法是在节点10和11的Dwn_Length之和下,用节点10和12的海拔高度差值来计算,然后将该斜率应用于节点10的Dwn_Length以插值出节点11的海拔高度。然而,对于一个具有许多缺失节点值的网络数据集来说,这似乎非常繁琐。

英文:

I have an example data frame below:

data = [
    [1, 2, 100, 4342],
    [3, 4, 100, 999],
    [5, 6, 500, 4339],
    [4, 5, 300, 999],
    [12, 13, 100, 4390],
    [6, 7, 600, 4335],
    [2, 3, 200, 4341],
    [10,11, 100, 4400],
    [11,12, 200, 999],
    [7, 8, 200, 4332]
]
df = pd.DataFrame(data, columns = ['Node','Dwn_Node', 'Dwn_Length','Elevation'])
df = df.replace(999, np.nan)

Where the Node column describes the name of the current node and Dwn_Node describes the name of the node 'down stream'. Elevation describes the elevation of the current node and Dwn_Length describes the length to the 'down stream' node. I am really not sure of the best way to complete this, but the goal would be to interpolate the missing values using slope. I am thinking there might be a function or better capability in networkx but am very unfamiliar with that library.

The above data set is an example data set but is accurate in that the node order is out of place.

One way I thought of would be to separate the previous and subsequent nodes of the unknown nodes i.e.

data1 = [
    [12, 13, 100, 4390],
    [10,11, 100, 4400],
    [11,12, 200, 999]
]

Calculate slope from data1 by taking the sum of Dwn_Length of nodes 10 and 11 under the difference in elevation values of node 10 and 12 then apply that slope to interpolate the elevation of node 11 given the Dwn_Length of node 10. This seems very tedious for a data set that has many sets of missing node values within a network though.

答案1

得分: 2

你可以提高这个过程的速度,但这应该有效:

import networkx as nx

# 从你的数据框创建网络
G = nx.from_pandas_edgelist(df, source='Node', target='Dwn_Node',
                            edge_attr='Dwn_Length', create_using=nx.DiGraph)

# 对于每个子图
for nbunch in nx.connected_components(G.to_undirected()):
    H = nx.subgraph(G, nbunch)
    roots = [n for n, d in H.in_degree if d == 0]
    leaves = [n for n, d in H.out_degree if d == 0]

    for root in roots:
        for leaf in leaves:
            for path in nx.all_simple_paths(H, root, leaf):
    
                # 提取并排序子图
                sort = lambda x: np.searchsorted(path, x)
                df1 = df[df['Node'].isin(path)].sort_values('Node', key=sort)
                df1['Distance'] = df1['Dwn_Length'].cumsum()

                # 分段线性插值
                m = df1['Elevation'].isna()
                x = df1.loc[m, 'Distance']
                xp = df1.loc[~m, 'Distance']
                yp = df1.loc[~m, 'Elevation']
                y = np.interp(x, xp, yp)
                df1.loc[m, 'Elevation'] = y
    
                # 更新缺失值
                df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])

如果我理解正确,我不确定你是否需要使用networkx来完成这个任务。需要插值来填充缺失值,但因为每个点之间的距离不是均匀间隔的,所以需要使用分段线性插值。numpy提供了 interp 方法。但是,你需要修改你的输入数据框:按节点排序并计算Dwn_Length的累积和。然后你就有了所有的x(Distance)和y(Elevation)值,以计算缺失值的插值:

# 准备工作
df1 = df.sort_values('Node').assign(Distance=lambda x: x['Dwn_Length'].cumsum())

# 分段线性插值
m = df1['Elevation'].isna()
x = df1.loc[m, 'Distance']
xp = df1.loc[~m, 'Distance']
yp = df1.loc[~m, 'Elevation']
y = np.interp(x, xp, yp)

# 可视化
df1.loc[m, 'Elevation'] = y
df1.plot(x='Distance', y='Elevation', ylabel='Elevation', marker='o', legend=False)
plt.show()

输出:

>>> df1
   Node  Dwn_Node  Dwn_Length    Elevation  Distance
0     1         2         100  4342.000000       100
6     2         3         200  4341.000000       300
1     3         4         100  4340.777778       400
3     4         5         300  4340.111111       700
2     5         6         500  4339.000000      1200
5     6         7         600  4335.000000      1800
9     7         8         200  4332.000000      2000
7    10        11         100  4400.000000      2100
8    11        12         200  4393.333333      2300
4    12        13         100  4390.000000      2400

显然,由于你的索引从dfdf1没有改变,你可以从df1中填充缺失值到df

df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])

输出:

>>> df
   Node  Dwn_Node  Dwn_Length    Elevation
0     1         2         100  4342.000000
1     3         4         100  4340.777778
2     5         6         500  4339.000000
3     4         5         300  4340.111111
4    12        13         100  4390.000000
5     6         7         600  4335.000000
6     2         3         200  4341.000000
7    10        11         100  4400.000000
8    11        12         200  4393.333333
9     7         8         200  4332.000000
英文:

You can probably enhanced the speed of the process but this should work:

import networkx as nx

# Create network from your dataframe
G = nx.from_pandas_edgelist(df, source='Node', target='Dwn_Node',
                            edge_attr='Dwn_Length', create_using=nx.DiGraph)
# nx.set_node_attributes(G, df.set_index('Node')[['Elevation']].to_dict())

# For each subgraph
for nbunch in nx.connected_components(G.to_undirected()):
    H = nx.subgraph(G, nbunch)
    roots = [n for n, d in H.in_degree if d == 0]
    leaves = [n for n, d in H.out_degree if d == 0]

    for root in roots:
        for leaf in leaves:
            for path in nx.all_simple_paths(H, root, leaf):
    
                # Extract and sort subgraph
                sort = lambda x: np.searchsorted(path, x)
                df1 = df[df['Node'].isin(path)].sort_values('Node', key=sort)
                df1['Distance'] = df1['Dwn_Length'].cumsum()

                # Piecewise linear interpolation
                m = df1['Elevation'].isna()
                x = df1.loc[m, 'Distance']
                xp = df1.loc[~m, 'Distance']
                yp = df1.loc[~m, 'Elevation']
                y = np.interp(x, xp, yp)
                df1.loc[m, 'Elevation'] = y
    
                # Update missing values
                df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])

IIUC, I'm not really sure you need networkx for this task. Interpolation is required to fill missing values but not linear because the distance between each point is not evenly spaced. You have to use piecewise linear interpolation. numpy provides the interp method. However you have to modify your input dataframe: sort it by node and compute the cumulative sum of Dwn_Length. After that you have all x (Distance) and y (Elevation) values to compute interpolation for missing values:

# Preparation
df1 = df.sort_values('Node').assign(Distance=lambda x: x['Dwn_Length'].cumsum())

# Piecewise linear interpolation
m = df1['Elevation'].isna()
x = df1.loc[m, 'Distance']
xp = df1.loc[~m, 'Distance']
yp = df1.loc[~m, 'Elevation']
y = np.interp(x, xp, yp)

# Visualization
df1.loc[m, 'Elevation'] = y
df1.plot(x='Distance', y='Elevation', ylabel='Elevation', marker='o', legend=False)
plt.show()

Output:

>>> df1
   Node  Dwn_Node  Dwn_Length    Elevation  Distance
0     1         2         100  4342.000000       100
6     2         3         200  4341.000000       300
1     3         4         100  4340.777778       400
3     4         5         300  4340.111111       700
2     5         6         500  4339.000000      1200
5     6         7         600  4335.000000      1800
9     7         8         200  4332.000000      2000
7    10        11         100  4400.000000      2100
8    11        12         200  4393.333333      2300
4    12        13         100  4390.000000      2400

在网络节点之间插值缺失数值的最佳方法

Obviously, as your index has not changed from df to df1, you can fill missing values from df1:

df['Elevation'] = df['Elevation'].fillna(df1['Elevation'])

Output:

>>> df
   Node  Dwn_Node  Dwn_Length    Elevation
0     1         2         100  4342.000000
1     3         4         100  4340.777778
2     5         6         500  4339.000000
3     4         5         300  4340.111111
4    12        13         100  4390.000000
5     6         7         600  4335.000000
6     2         3         200  4341.000000
7    10        11         100  4400.000000
8    11        12         200  4393.333333
9     7         8         200  4332.000000

huangapple
  • 本文由 发表于 2023年6月1日 11:33:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76378524.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定