2023年7月31日 21:46:14go评论162阅读模式

英文:

Is it possible to speed up this pandas data extraction?

问题

X = []
Y = []
for i in cell_Data['ID']:
    A = face_Data.query(f"Cell_0 == {i} or Cell_1 == {i}")
    X.append(np.average(A['X']))
    Y.append(np.average(A['Y']))

cell_Data['X'] = X
cell_Data['Y'] = Y
print("Cell Coordinates Obtained")

face_Data是一个包含两列Cell_0和Cell_1的数据框，它们都是每个面界定的细胞的ID。face_Data还有它自己的X和Y坐标列。我的目标是从中计算细胞的坐标。由于查询值本身每次都在变化，所以我无法直接进行矢量化。

cell_Data['ID'] = list(1, 2, .... N)

我可以获取细胞坐标，但循环遍历每个记录需要很长时间，因为face_Data的大小通常是5或6的数量级。

示例数据如下：

face_data = pd.DataFrame({
             'ID' = [1, 2, 3, 4, 5, 6],
             'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
             'Y': [1, 1, 1, 2, 2, 2],
             'Cell_0' = [1, 2, 3, 5, 6, 7],
             'Cell_1' = [2, 3, 4, 6, 7, 8]
            })
cell_data = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8 ...], columns = ['ID'])

数据网格如下所示：

(5) | (6) | (7) | (8) 
    4-----5-----6-----
(1) | (2) | (3) | (4)
    1-----2-----3-----

（cell_ID）显示在括号中，face_ID显示在垂直线下方。只是一个二维笛卡尔坐标系。

face_Data.head看起来像这样：


<details>
<summary>英文:</summary>

```python
X = []
Y = []
for i in cell_Data[&#39;ID&#39;]:
    A = face_Data.query(f&quot;Cell_0 == {i} or Cell_1 == {i}&quot;)
    X.append(np.average(A[&#39;X&#39;]))
    Y.append(np.average(A[&#39;Y&#39;]))

cell_Data[&#39;X&#39;] = X
cell_Data[&#39;Y&#39;] = Y
print(&quot;Cell Coordinates Obtained&quot;)

face_Data is a dataframe containing two columns Cell_0 and Cell_1, both are IDs of cells that bound each face. face_Data also has its own coordinates in X and Y columns. My goal is to calculate coordinates of cell from that. Since the query value itself is changing everytime, I can't vectorize them directly.

cell_Data['ID'] = list(1,2,....N)

I could get the cell coordinates but looping through each record takes huge time as the size of face_Data is typically of order of 5 or 6.

Sample Data looks like this.

face_data = pd.DataFrame({
             &#39;ID&#39; = [1, 2, 3, 4, 5, 6],
             &#39;X&#39;: [0, 0.1, 0.2, 0, 0.1, 0.2],
             &#39;Y&#39;: [1, 1, 1, 2, 2, 2],
             &#39;Cell_0&#39; = [1, 2, 3, 5, 6, 7],
             &#39;Cell_1&#39;] = [2, 3, 4, 6, 7, 8]
            })
cell_data = pd.DataFrame([1, 2, 3, 4, 5, 6, 7, 8 ...], columns = [&#39;ID&#39;])

How the data grid looks like is shown below:

(5) | (6) | (7) | (8) 
    4-----5-----6-----
(1) | (2) | (3) | (4)
    1-----2-----3-----

The (cell_ID) is shown in brackets, the face_ID is shown below the vertical line. Just a 2D cartesian grid.

face_Data.head looks like this:

答案1

得分: 1

使用melt和groupby的帮助，我们可以实现以下目标：

face_Data = pd.DataFrame({
             'X': [0, 0.1, 0.2, 0, 0.1, 0.2],
             'Y': [1, 1, 1, 2, 2, 2],
             'Cell_0': [1, 2, 1, 3, 1, 2],
             'Cell_1': [2, 3, 3, 1, 4, 2]
            })

face_Data["face_id"] = face_Data.index + 1
cell_Data = face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
   .drop(columns=["variable"])\
   .groupby(["face_id", "ID"])\
   .first()\
   .groupby("ID")\
   .mean()

这个思路是创建一个只包含一个ID列的表，这是通过melt实现的：

>>> face_Data["face_id"] = face_Data.index + 1
>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")
      X  Y  face_id variable  ID
0   0.0  1        1   Cell_0   1
1   0.1  1        2   Cell_0   2
2   0.2  1        3   Cell_0   1
3   0.0  2        4   Cell_0   3
4   0.1  2        5   Cell_0   1
5   0.2  2        6   Cell_0   2
6   0.0  1        1   Cell_1   2
7   0.1  1        2   Cell_1   3
8   0.2  1        3   Cell_1   3
9   0.0  2        4   Cell_1   1
10  0.1  2        5   Cell_1   4
11  0.2  2        6   Cell_1   2

但是，如果我们按ID进行分组（.drop(columns=["variable"]).groupby("ID").average()），我们将会将face_id为6的行计算两次（输出中的第11行和第5行）。这是因为在face_Data中，Cell_0 == Cell_1的行。

为了移除这些行，我们首先按face_id和ID进行分组，然后使用first()：

>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
...    .drop(columns=["variable"])\
...    .groupby(["face_id", "ID"])\
...    .first()
              X  Y
face_id ID        
1       1   0.0  1
        2   0.0  1
2       2   0.1  1
        3   0.1  1
3       1   0.2  1
        3   0.2  1
4       1   0.0  2
        3   0.0  2
5       1   0.1  2
        4   0.1  2
6       2   0.2  2

现在，我们可以对这个表按ID进行分组并计算相同ID的所有X和Y值的平均值：

>>> face_Data.melt(id_vars=["X", "Y", "face_id"], value_name="ID")\
...    .drop(columns=["variable"])\
...    .groupby(["face_id", "ID"])\
...    .first()\
...    .groupby("ID")\
...    .mean()
        X         Y
ID                 
1   0.075  1.500000
2   0.100  1.333333
3   0.100  1.333333
4   0.100  2.000000

英文:

With the help of melt and groupby we can archive this:

face_Data = pd.DataFrame({
             &#39;X&#39;: [0, 0.1, 0.2, 0, 0.1, 0.2],
             &#39;Y&#39;: [1, 1, 1, 2, 2, 2],
             &#39;Cell_0&#39;: [1, 2, 1, 3, 1, 2],
             &#39;Cell_1&#39;: [2, 3, 3, 1, 4, 2]
            })

face_Data[&quot;face_id&quot;] = face_Data.index + 1
cell_Data = face_Data.melt(id_vars=[&quot;X&quot;, &quot;Y&quot;, &quot;face_id&quot;], value_name=&quot;ID&quot;)\
   .drop(columns=[&quot;variable&quot;])\
   .groupby([&quot;face_id&quot;, &quot;ID&quot;])\
   .first()\
   .groupby(&quot;ID&quot;)\
   .mean()

The idea is to create a table that has only one column for the ID, this is done with melt:

&gt;&gt;&gt; face_Data[&quot;face_id&quot;] = face_Data.index + 1
&gt;&gt;&gt; face_Data.melt(id_vars=[&quot;X&quot;, &quot;Y&quot;, &quot;face_id&quot;], value_name=&quot;ID&quot;)
      X  Y variable  ID
      X  Y  face_id variable  ID
0   0.0  1        1   Cell_0   1
1   0.1  1        2   Cell_0   2
2   0.2  1        3   Cell_0   1
3   0.0  2        4   Cell_0   3
4   0.1  2        5   Cell_0   1
5   0.2  2        6   Cell_0   2
6   0.0  1        1   Cell_1   2
7   0.1  1        2   Cell_1   3
8   0.2  1        3   Cell_1   3
9   0.0  2        4   Cell_1   1
10  0.1  2        5   Cell_1   4
11  0.2  2        6   Cell_1   2

But if we would do a group by ID (.drop(columns=["variable"]).groupby("ID").average()) we would count the row with face_id 6 two times (row 11 and 5 in the output above).
This happens for rows in face_Data where Cell_0 == Cell_1.
To remove these rows we do a group by face_id and ID followed by a first():

&gt;&gt;&gt; face_Data.melt(id_vars=[&quot;X&quot;, &quot;Y&quot;, &quot;face_id&quot;], value_name=&quot;ID&quot;)\
...    .drop(columns=[&quot;variable&quot;])\
...    .groupby([&quot;face_id&quot;, &quot;ID&quot;])\
...    .first()
              X  Y
face_id ID        
1       1   0.0  1
        2   0.0  1
2       2   0.1  1
        3   0.1  1
3       1   0.2  1
        3   0.2  1
4       1   0.0  2
        3   0.0  2
5       1   0.1  2
        4   0.1  2
6       2   0.2  2

Note that the point (X=0.2 Y=2) only appears once.

On this table we can now do a groupby("ID).average() to average all the X and Y values for the same ID:

&gt;&gt;&gt; face_Data.melt(id_vars=[&quot;X&quot;, &quot;Y&quot;, &quot;face_id&quot;], value_name=&quot;ID&quot;)\
...    .drop(columns=[&quot;variable&quot;])\
...    .groupby([&quot;face_id&quot;, &quot;ID&quot;])\
...    .first()\
...    .groupby(&quot;ID&quot;)\
...    .mean()
        X         Y
ID                 
1   0.075  1.500000
2   0.100  1.333333
3   0.100  1.333333
4   0.100  2.000000

答案2

得分: 0

感谢 @Runinho 提供的答案。

这对我有用。

face_Data['face_id'] = face_Data.index + 1
test = face_Data.melt(id_vars=["face_id", "X", "Y"], value_vars=["Cell_0", "Cell_1"], value_name="ID")
test = test.drop(columns=["variable"]).groupby("ID").mean()[1:]
cell_Data["X"] = test["X"]
cell_Data["Y"] = test["Y"]

英文:

Thanks @Runinho for the answer.

This worked for me.

face_Data[&#39;face_id&#39;] = face_Data.index + 1
test = face_Data.melt(id_vars=[&quot;face_id&quot;, &quot;X&quot;, &quot;Y&quot;], value_vars = [&#39;Cell_0&#39;, &#39;Cell_1&#39;], value_name = &#39;ID&#39;)
test = test.drop(columns = [&#39;variable&#39;]).groupby(&#39;ID&#39;).mean()[1:]
cell_Data[&#39;X&#39;] = test[&#39;X&#39;]
cell_Data[&#39;Y&#39;] = test[&#39;Y&#39;]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有可能加速这个 pandas 数据提取吗？

问题

答案1

答案2

如何将VTK截图导出为PDF或矢量化图像？

将字符串转换为浮点值

当连接两个数据框时，会添加一个额外的行。

Python, numpy性能不一致

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论