有可能加速这个 pandas 数据提取吗?

huangapple go评论126阅读模式
英文:

Is it possible to speed up this pandas data extraction?

问题

X = []
Y = []
for i in cell_Data['ID']:
    A = face_Data.query(f"Cell_0 == {i} or Cell_1 == {i}")
    X.append(np.average(A['X']))
    Y.append(np.average(A['Y']))

cell_Data['X'] = X
cell_Data['Y'] = Y
print("Cell Coordinates Obtained")

face_Data是一个包含两列Cell_0和Cell_1的数据框,它们都是每个面界定的细胞的ID。face_Data还有它自己的X和Y坐标列。我的目标是从中计算细胞的坐标。由于查询值本身每次都在变化,所以我无法直接进行矢量化。

cell_Data['ID'] = list(1, 2, .... N)

我可以获取细胞坐标,但循环遍历每个记录需要很长时间,因为face_Data的大小通常是5或6的数量级。

示例数据如下:

face_data = pd.DataFrame({
             'ID' = [1, 2, 3, 4, 5, 6],
             'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
             'Y': [1, 1, 1, 2, 2, 2],
             'Cell_0' = [1, 2, 3, 5, 6, 7],
             'Cell_1' = [2, 3, 4, 6, 7, 8]
            })
cell_data = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8 ...], columns = ['ID'])

数据网格如下所示:

(5) | (6) | (7) | (8) 
    4-----5-----6-----
(1) | (2) | (3) | (4)
    1-----2-----3-----

(cell_ID)显示在括号中,face_ID显示在垂直线下方。只是一个二维笛卡尔坐标系。

face_Data.head看起来像这样:有可能加速这个 pandas 数据提取吗?


<details>
<summary>英文:</summary>

```python
X = []
Y = []
for i in cell_Data[&#39;ID&#39;]:
    A = face_Data.query(f&quot;Cell_0 == {i} or Cell_1 == {i}&quot;)
    X.append(np.average(A[&#39;X&#39;]))
    Y.append(np.average(A[&#39;Y&#39;]))

cell_Data[&#39;X&#39;] = X
cell_Data[&#39;Y&#39;] = Y
print(&quot;Cell Coordinates Obtained&quot;)

face_Data is a dataframe containing two columns Cell_0 and Cell_1, both are IDs of cells that bound each face. face_Data also has its own coordinates in X and Y columns. My goal is to calculate coordinates of cell from that. Since the query value itself is changing everytime, I can't vectorize them directly.

cell_Data['ID'] = list(1,2,....N)

I could get the cell coordinates but looping through each record takes huge time as the size of face_Data is typically of order of 5 or 6.

Sample Data looks like this.

face_data = pd.DataFrame({
             &#39;ID&#39; = [1, 2, 3, 4, 5, 6],
             &#39;X&#39;: [0, 0.1, 0.2, 0, 0.1, 0.2],
             &#39;Y&#39;: [1, 1, 1, 2, 2, 2],
             &#39;Cell_0&#39; = [1, 2, 3, 5, 6, 7],
             &#39;Cell_1&#39;] = [2, 3, 4, 6, 7, 8]
            })
cell_data = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8 ...], columns = [&#39;ID&#39;])

How the data grid looks like is shown below:

(5) | (6) | (7) | (8) 
    4-----5-----6-----
(1) | (2) | (3) | (4)
    1-----2-----3-----

The (cell_ID) is shown in brackets, the face_ID is shown below the vertical line. Just a 2D cartesian grid.

face_Data.head looks like this:
有可能加速这个 pandas 数据提取吗?

答案1

得分: 1

使用meltgroupby的帮助,我们可以实现以下目标:

face_Data = pd.DataFrame({
             'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
             'Y': [1, 1, 1, 2, 2, 2],
             'Cell_0': [1, 2, 1, 3, 1, 2],
             'Cell_1': [2, 3, 3, 1, 4, 2]
            })

face_Data["face_id"] = face_Data.index + 1
cell_Data = face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
   .drop(columns=["variable"])\
   .groupby(["face_id", "ID"])\
   .first()\
   .groupby("ID")\
   .mean()

这个思路是创建一个只包含一个ID列的表,这是通过melt实现的:

>>> face_Data["face_id"] = face_Data.index + 1
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")
      X  Y  face_id variable  ID
0   0.0  1        1   Cell_0   1
1   0.1  1        2   Cell_0   2
2   0.2  1        3   Cell_0   1
3   0.0  2        4   Cell_0   3
4   0.1  2        5   Cell_0   1
5   0.2  2        6   Cell_0   2
6   0.0  1        1   Cell_1   2
7   0.1  1        2   Cell_1   3
8   0.2  1        3   Cell_1   3
9   0.0  2        4   Cell_1   1
10  0.1  2        5   Cell_1   4
11  0.2  2        6   Cell_1   2

但是,如果我们按ID进行分组(.drop(columns=["variable"]).groupby("ID").average()),我们将会将face_id为6的行计算两次(输出中的第11行和第5行)。这是因为在face_Data中,Cell_0 == Cell_1的行。

为了移除这些行,我们首先按face_idID进行分组,然后使用first()

>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
...    .drop(columns=["variable"])\
...    .groupby(["face_id", "ID"])\
...    .first()
              X  Y
face_id ID        
1       1   0.0  1
        2   0.0  1
2       2   0.1  1
        3   0.1  1
3       1   0.2  1
        3   0.2  1
4       1   0.0  2
        3   0.0  2
5       1   0.1  2
        4   0.1  2
6       2   0.2  2

现在,我们可以对这个表按ID进行分组并计算相同ID的所有X和Y值的平均值:

>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
...    .drop(columns=["variable"])\
...    .groupby(["face_id", "ID"])\
...    .first()\
...    .groupby("ID")\
...    .mean()
        X         Y
ID                 
1   0.075  1.500000
2   0.100  1.333333
3   0.100  1.333333
4   0.100  2.000000
英文:

With the help of melt and groupby we can archive this:

face_Data = pd.DataFrame({
             &#39;X&#39;: [0, 0.1, 0.2, 0, 0.1, 0.2],
             &#39;Y&#39;: [1, 1, 1, 2, 2, 2],
             &#39;Cell_0&#39;: [1, 2, 1, 3, 1, 2],
             &#39;Cell_1&#39;: [2, 3, 3, 1, 4, 2]
            })

face_Data[&quot;face_id&quot;] = face_Data.index + 1
cell_Data = face_Data.melt(id_vars=[&quot;X&quot;, &quot;Y&quot;, &quot;face_id&quot;], value_name=&quot;ID&quot;)\
   .drop(columns=[&quot;variable&quot;])\
   .groupby([&quot;face_id&quot;, &quot;ID&quot;])\
   .first()\
   .groupby(&quot;ID&quot;)\
   .mean()

The idea is to create a table that has only one column for the ID, this is done with melt:

&gt;&gt;&gt; face_Data[&quot;face_id&quot;] = face_Data.index + 1
&gt;&gt;&gt; face_Data.melt(id_vars=[&quot;X&quot;, &quot;Y&quot;, &quot;face_id&quot;], value_name=&quot;ID&quot;)
      X  Y variable  ID
      X  Y  face_id variable  ID
0   0.0  1        1   Cell_0   1
1   0.1  1        2   Cell_0   2
2   0.2  1        3   Cell_0   1
3   0.0  2        4   Cell_0   3
4   0.1  2        5   Cell_0   1
5   0.2  2        6   Cell_0   2
6   0.0  1        1   Cell_1   2
7   0.1  1        2   Cell_1   3
8   0.2  1        3   Cell_1   3
9   0.0  2        4   Cell_1   1
10  0.1  2        5   Cell_1   4
11  0.2  2        6   Cell_1   2

But if we would do a group by ID (.drop(columns=[&quot;variable&quot;]).groupby(&quot;ID&quot;).average()) we would count the row with face_id 6 two times (row 11 and 5 in the output above).
This happens for rows in face_Data where Cell_0 == Cell_1.
To remove these rows we do a group by face_id and ID followed by a first():

&gt;&gt;&gt; face_Data.melt(id_vars=[&quot;X&quot;, &quot;Y&quot;, &quot;face_id&quot;], value_name=&quot;ID&quot;)\
...    .drop(columns=[&quot;variable&quot;])\
...    .groupby([&quot;face_id&quot;, &quot;ID&quot;])\
...    .first()
              X  Y
face_id ID        
1       1   0.0  1
        2   0.0  1
2       2   0.1  1
        3   0.1  1
3       1   0.2  1
        3   0.2  1
4       1   0.0  2
        3   0.0  2
5       1   0.1  2
        4   0.1  2
6       2   0.2  2

Note that the point (X=0.2 Y=2) only appears once.

On this table we can now do a groupby(&quot;ID).average() to average all the X and Y values for the same ID:

&gt;&gt;&gt; face_Data.melt(id_vars=[&quot;X&quot;, &quot;Y&quot;, &quot;face_id&quot;], value_name=&quot;ID&quot;)\
...    .drop(columns=[&quot;variable&quot;])\
...    .groupby([&quot;face_id&quot;, &quot;ID&quot;])\
...    .first()\
...    .groupby(&quot;ID&quot;)\
...    .mean()
        X         Y
ID                 
1   0.075  1.500000
2   0.100  1.333333
3   0.100  1.333333
4   0.100  2.000000

答案2

得分: 0

感谢 @Runinho 提供的答案。

这对我有用。

face_Data['face_id'] = face_Data.index + 1
test = face_Data.melt(id_vars=["face_id", "X", "Y"], value_vars=["Cell_0", "Cell_1"], value_name="ID")
test = test.drop(columns=["variable"]).groupby("ID").mean()[1:]
cell_Data["X"] = test["X"]
cell_Data["Y"] = test["Y"]
英文:

Thanks @Runinho for the answer.

This worked for me.

face_Data[&#39;face_id&#39;] = face_Data.index + 1
test = face_Data.melt(id_vars=[&quot;face_id&quot;, &quot;X&quot;, &quot;Y&quot;], value_vars = [&#39;Cell_0&#39;, &#39;Cell_1&#39;], value_name = &#39;ID&#39;)
test = test.drop(columns = [&#39;variable&#39;]).groupby(&#39;ID&#39;).mean()[1:]
cell_Data[&#39;X&#39;] = test[&#39;X&#39;]
cell_Data[&#39;Y&#39;] = test[&#39;Y&#39;]

huangapple
  • 本文由 发表于 2023年7月31日 21:46:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76804252.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定