英文:
Is it possible to speed up this pandas data extraction?
问题
X = []
Y = []
for i in cell_Data['ID']:
    A = face_Data.query(f"Cell_0 == {i} or Cell_1 == {i}")
    X.append(np.average(A['X']))
    Y.append(np.average(A['Y']))
cell_Data['X'] = X
cell_Data['Y'] = Y
print("Cell Coordinates Obtained")
face_Data是一个包含两列Cell_0和Cell_1的数据框,它们都是每个面界定的细胞的ID。face_Data还有它自己的X和Y坐标列。我的目标是从中计算细胞的坐标。由于查询值本身每次都在变化,所以我无法直接进行矢量化。
cell_Data['ID'] = list(1, 2, .... N)
我可以获取细胞坐标,但循环遍历每个记录需要很长时间,因为face_Data的大小通常是5或6的数量级。
示例数据如下:
face_data = pd.DataFrame({
             'ID' = [1, 2, 3, 4, 5, 6],
             'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
             'Y': [1, 1, 1, 2, 2, 2],
             'Cell_0' = [1, 2, 3, 5, 6, 7],
             'Cell_1' = [2, 3, 4, 6, 7, 8]
            })
cell_data = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8 ...], columns = ['ID'])
数据网格如下所示:
(5) | (6) | (7) | (8) 
    4-----5-----6-----
(1) | (2) | (3) | (4)
    1-----2-----3-----
(cell_ID)显示在括号中,face_ID显示在垂直线下方。只是一个二维笛卡尔坐标系。
<details>
<summary>英文:</summary>
```python
X = []
Y = []
for i in cell_Data['ID']:
    A = face_Data.query(f"Cell_0 == {i} or Cell_1 == {i}")
    X.append(np.average(A['X']))
    Y.append(np.average(A['Y']))
cell_Data['X'] = X
cell_Data['Y'] = Y
print("Cell Coordinates Obtained")
face_Data is a dataframe containing two columns Cell_0 and Cell_1, both are IDs of cells that bound each face. face_Data also has its own coordinates in X and Y columns. My goal is to calculate coordinates of cell from that. Since the query value itself is changing everytime, I can't vectorize them directly.
cell_Data['ID'] = list(1,2,....N)
I could get the cell coordinates but looping through each record takes huge time as the size of face_Data is typically of order of 5 or 6.
Sample Data looks like this.
face_data = pd.DataFrame({
             'ID' = [1, 2, 3, 4, 5, 6],
             'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
             'Y': [1, 1, 1, 2, 2, 2],
             'Cell_0' = [1, 2, 3, 5, 6, 7],
             'Cell_1'] = [2, 3, 4, 6, 7, 8]
            })
cell_data = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8 ...], columns = ['ID'])
How the data grid looks like is shown below:
(5) | (6) | (7) | (8) 
    4-----5-----6-----
(1) | (2) | (3) | (4)
    1-----2-----3-----
The (cell_ID) is shown in brackets, the face_ID is shown below the vertical line. Just a 2D cartesian grid.
face_Data.head looks like this:

答案1
得分: 1
face_Data = pd.DataFrame({
             'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
             'Y': [1, 1, 1, 2, 2, 2],
             'Cell_0': [1, 2, 1, 3, 1, 2],
             'Cell_1': [2, 3, 3, 1, 4, 2]
            })
face_Data["face_id"] = face_Data.index + 1
cell_Data = face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
   .drop(columns=["variable"])\
   .groupby(["face_id", "ID"])\
   .first()\
   .groupby("ID")\
   .mean()
这个思路是创建一个只包含一个ID列的表,这是通过melt实现的:
>>> face_Data["face_id"] = face_Data.index + 1
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")
      X  Y  face_id variable  ID
0   0.0  1        1   Cell_0   1
1   0.1  1        2   Cell_0   2
2   0.2  1        3   Cell_0   1
3   0.0  2        4   Cell_0   3
4   0.1  2        5   Cell_0   1
5   0.2  2        6   Cell_0   2
6   0.0  1        1   Cell_1   2
7   0.1  1        2   Cell_1   3
8   0.2  1        3   Cell_1   3
9   0.0  2        4   Cell_1   1
10  0.1  2        5   Cell_1   4
11  0.2  2        6   Cell_1   2
但是,如果我们按ID进行分组(.drop(columns=["variable"]).groupby("ID").average()),我们将会将face_id为6的行计算两次(输出中的第11行和第5行)。这是因为在face_Data中,Cell_0 == Cell_1的行。
为了移除这些行,我们首先按face_id和ID进行分组,然后使用first():
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
...    .drop(columns=["variable"])\
...    .groupby(["face_id", "ID"])\
...    .first()
              X  Y
face_id ID        
1       1   0.0  1
        2   0.0  1
2       2   0.1  1
        3   0.1  1
3       1   0.2  1
        3   0.2  1
4       1   0.0  2
        3   0.0  2
5       1   0.1  2
        4   0.1  2
6       2   0.2  2
现在,我们可以对这个表按ID进行分组并计算相同ID的所有X和Y值的平均值:
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
...    .drop(columns=["variable"])\
...    .groupby(["face_id", "ID"])\
...    .first()\
...    .groupby("ID")\
...    .mean()
        X         Y
ID                 
1   0.075  1.500000
2   0.100  1.333333
3   0.100  1.333333
4   0.100  2.000000
英文:
With the help of melt and groupby we can archive this:
face_Data = pd.DataFrame({
             'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
             'Y': [1, 1, 1, 2, 2, 2],
             'Cell_0': [1, 2, 1, 3, 1, 2],
             'Cell_1': [2, 3, 3, 1, 4, 2]
            })
face_Data["face_id"] = face_Data.index + 1
cell_Data = face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
   .drop(columns=["variable"])\
   .groupby(["face_id", "ID"])\
   .first()\
   .groupby("ID")\
   .mean()
The idea is to create a table that has only one column for the ID, this is done with melt:
>>> face_Data["face_id"] = face_Data.index + 1
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")
      X  Y variable  ID
      X  Y  face_id variable  ID
0   0.0  1        1   Cell_0   1
1   0.1  1        2   Cell_0   2
2   0.2  1        3   Cell_0   1
3   0.0  2        4   Cell_0   3
4   0.1  2        5   Cell_0   1
5   0.2  2        6   Cell_0   2
6   0.0  1        1   Cell_1   2
7   0.1  1        2   Cell_1   3
8   0.2  1        3   Cell_1   3
9   0.0  2        4   Cell_1   1
10  0.1  2        5   Cell_1   4
11  0.2  2        6   Cell_1   2
But if we would do a group by ID (.drop(columns=["variable"]).groupby("ID").average()) we would count the row with face_id 6 two times (row 11 and 5 in the output above).
This happens for rows in face_Data where Cell_0 == Cell_1.
To remove these rows we do a group by face_id and ID followed by a first():
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
...    .drop(columns=["variable"])\
...    .groupby(["face_id", "ID"])\
...    .first()
              X  Y
face_id ID        
1       1   0.0  1
        2   0.0  1
2       2   0.1  1
        3   0.1  1
3       1   0.2  1
        3   0.2  1
4       1   0.0  2
        3   0.0  2
5       1   0.1  2
        4   0.1  2
6       2   0.2  2
Note that the point (X=0.2 Y=2) only appears once.
On this table we can now do a groupby("ID).average() to average all the X and Y values for the same ID:
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
...    .drop(columns=["variable"])\
...    .groupby(["face_id", "ID"])\
...    .first()\
...    .groupby("ID")\
...    .mean()
        X         Y
ID                 
1   0.075  1.500000
2   0.100  1.333333
3   0.100  1.333333
4   0.100  2.000000
答案2
得分: 0
感谢 @Runinho 提供的答案。
这对我有用。
face_Data['face_id'] = face_Data.index + 1
test = face_Data.melt(id_vars=["face_id", "X", "Y"], value_vars=["Cell_0", "Cell_1"], value_name="ID")
test = test.drop(columns=["variable"]).groupby("ID").mean()[1:]
cell_Data["X"] = test["X"]
cell_Data["Y"] = test["Y"]
英文:
Thanks @Runinho for the answer.
This worked for me.
face_Data['face_id'] = face_Data.index + 1
test = face_Data.melt(id_vars=["face_id", "X", "Y"], value_vars = ['Cell_0', 'Cell_1'], value_name = 'ID')
test = test.drop(columns = ['variable']).groupby('ID').mean()[1:]
cell_Data['X'] = test['X']
cell_Data['Y'] = test['Y']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论