英文:
Is it possible to speed up this pandas data extraction?
问题
X = []
Y = []
for i in cell_Data['ID']:
A = face_Data.query(f"Cell_0 == {i} or Cell_1 == {i}")
X.append(np.average(A['X']))
Y.append(np.average(A['Y']))
cell_Data['X'] = X
cell_Data['Y'] = Y
print("Cell Coordinates Obtained")
face_Data
是一个包含两列Cell_0和Cell_1的数据框,它们都是每个面界定的细胞的ID。face_Data
还有它自己的X和Y坐标列。我的目标是从中计算细胞的坐标。由于查询值本身每次都在变化,所以我无法直接进行矢量化。
cell_Data['ID'] = list(1, 2, .... N)
我可以获取细胞坐标,但循环遍历每个记录需要很长时间,因为face_Data
的大小通常是5或6的数量级。
示例数据如下:
face_data = pd.DataFrame({
'ID' = [1, 2, 3, 4, 5, 6],
'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
'Y': [1, 1, 1, 2, 2, 2],
'Cell_0' = [1, 2, 3, 5, 6, 7],
'Cell_1' = [2, 3, 4, 6, 7, 8]
})
cell_data = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8 ...], columns = ['ID'])
数据网格如下所示:
(5) | (6) | (7) | (8)
4-----5-----6-----
(1) | (2) | (3) | (4)
1-----2-----3-----
(cell_ID)显示在括号中,face_ID显示在垂直线下方。只是一个二维笛卡尔坐标系。
<details>
<summary>英文:</summary>
```python
X = []
Y = []
for i in cell_Data['ID']:
A = face_Data.query(f"Cell_0 == {i} or Cell_1 == {i}")
X.append(np.average(A['X']))
Y.append(np.average(A['Y']))
cell_Data['X'] = X
cell_Data['Y'] = Y
print("Cell Coordinates Obtained")
face_Data is a dataframe containing two columns Cell_0 and Cell_1, both are IDs of cells that bound each face. face_Data also has its own coordinates in X and Y columns. My goal is to calculate coordinates of cell from that. Since the query value itself is changing everytime, I can't vectorize them directly.
cell_Data['ID'] = list(1,2,....N)
I could get the cell coordinates but looping through each record takes huge time as the size of face_Data is typically of order of 5 or 6.
Sample Data looks like this.
face_data = pd.DataFrame({
'ID' = [1, 2, 3, 4, 5, 6],
'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
'Y': [1, 1, 1, 2, 2, 2],
'Cell_0' = [1, 2, 3, 5, 6, 7],
'Cell_1'] = [2, 3, 4, 6, 7, 8]
})
cell_data = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8 ...], columns = ['ID'])
How the data grid looks like is shown below:
(5) | (6) | (7) | (8)
4-----5-----6-----
(1) | (2) | (3) | (4)
1-----2-----3-----
The (cell_ID) is shown in brackets, the face_ID is shown below the vertical line. Just a 2D cartesian grid.
face_Data.head looks like this:
答案1
得分: 1
face_Data = pd.DataFrame({
'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
'Y': [1, 1, 1, 2, 2, 2],
'Cell_0': [1, 2, 1, 3, 1, 2],
'Cell_1': [2, 3, 3, 1, 4, 2]
})
face_Data["face_id"] = face_Data.index + 1
cell_Data = face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
.drop(columns=["variable"])\
.groupby(["face_id", "ID"])\
.first()\
.groupby("ID")\
.mean()
这个思路是创建一个只包含一个ID列的表,这是通过melt
实现的:
>>> face_Data["face_id"] = face_Data.index + 1
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")
X Y face_id variable ID
0 0.0 1 1 Cell_0 1
1 0.1 1 2 Cell_0 2
2 0.2 1 3 Cell_0 1
3 0.0 2 4 Cell_0 3
4 0.1 2 5 Cell_0 1
5 0.2 2 6 Cell_0 2
6 0.0 1 1 Cell_1 2
7 0.1 1 2 Cell_1 3
8 0.2 1 3 Cell_1 3
9 0.0 2 4 Cell_1 1
10 0.1 2 5 Cell_1 4
11 0.2 2 6 Cell_1 2
但是,如果我们按ID进行分组(.drop(columns=["variable"]).groupby("ID").average()
),我们将会将face_id为6的行计算两次(输出中的第11行和第5行)。这是因为在face_Data
中,Cell_0 == Cell_1的行。
为了移除这些行,我们首先按face_id
和ID
进行分组,然后使用first()
:
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
... .drop(columns=["variable"])\
... .groupby(["face_id", "ID"])\
... .first()
X Y
face_id ID
1 1 0.0 1
2 0.0 1
2 2 0.1 1
3 0.1 1
3 1 0.2 1
3 0.2 1
4 1 0.0 2
3 0.0 2
5 1 0.1 2
4 0.1 2
6 2 0.2 2
现在,我们可以对这个表按ID
进行分组并计算相同ID的所有X和Y值的平均值:
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
... .drop(columns=["variable"])\
... .groupby(["face_id", "ID"])\
... .first()\
... .groupby("ID")\
... .mean()
X Y
ID
1 0.075 1.500000
2 0.100 1.333333
3 0.100 1.333333
4 0.100 2.000000
英文:
With the help of melt
and groupby
we can archive this:
face_Data = pd.DataFrame({
'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
'Y': [1, 1, 1, 2, 2, 2],
'Cell_0': [1, 2, 1, 3, 1, 2],
'Cell_1': [2, 3, 3, 1, 4, 2]
})
face_Data["face_id"] = face_Data.index + 1
cell_Data = face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
.drop(columns=["variable"])\
.groupby(["face_id", "ID"])\
.first()\
.groupby("ID")\
.mean()
The idea is to create a table that has only one column for the ID, this is done with melt
:
>>> face_Data["face_id"] = face_Data.index + 1
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")
X Y variable ID
X Y face_id variable ID
0 0.0 1 1 Cell_0 1
1 0.1 1 2 Cell_0 2
2 0.2 1 3 Cell_0 1
3 0.0 2 4 Cell_0 3
4 0.1 2 5 Cell_0 1
5 0.2 2 6 Cell_0 2
6 0.0 1 1 Cell_1 2
7 0.1 1 2 Cell_1 3
8 0.2 1 3 Cell_1 3
9 0.0 2 4 Cell_1 1
10 0.1 2 5 Cell_1 4
11 0.2 2 6 Cell_1 2
But if we would do a group by ID (.drop(columns=["variable"]).groupby("ID").average()
) we would count the row with face_id 6 two times (row 11 and 5 in the output above).
This happens for rows in face_Data
where Cell_0 == Cell_1.
To remove these rows we do a group by face_id
and ID
followed by a first()
:
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
... .drop(columns=["variable"])\
... .groupby(["face_id", "ID"])\
... .first()
X Y
face_id ID
1 1 0.0 1
2 0.0 1
2 2 0.1 1
3 0.1 1
3 1 0.2 1
3 0.2 1
4 1 0.0 2
3 0.0 2
5 1 0.1 2
4 0.1 2
6 2 0.2 2
Note that the point (X=0.2 Y=2) only appears once.
On this table we can now do a groupby("ID).average()
to average all the X and Y values for the same ID:
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
... .drop(columns=["variable"])\
... .groupby(["face_id", "ID"])\
... .first()\
... .groupby("ID")\
... .mean()
X Y
ID
1 0.075 1.500000
2 0.100 1.333333
3 0.100 1.333333
4 0.100 2.000000
答案2
得分: 0
感谢 @Runinho 提供的答案。
这对我有用。
face_Data['face_id'] = face_Data.index + 1
test = face_Data.melt(id_vars=["face_id", "X", "Y"], value_vars=["Cell_0", "Cell_1"], value_name="ID")
test = test.drop(columns=["variable"]).groupby("ID").mean()[1:]
cell_Data["X"] = test["X"]
cell_Data["Y"] = test["Y"]
英文:
Thanks @Runinho for the answer.
This worked for me.
face_Data['face_id'] = face_Data.index + 1
test = face_Data.melt(id_vars=["face_id", "X", "Y"], value_vars = ['Cell_0', 'Cell_1'], value_name = 'ID')
test = test.drop(columns = ['variable']).groupby('ID').mean()[1:]
cell_Data['X'] = test['X']
cell_Data['Y'] = test['Y']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论