2023年7月27日 22:03:44go评论179阅读模式

英文:

Vectorize the assignment of a column in a pandas dataframe where a custom index has many rows and the column value is set using all rows in the index

问题

I cannot share the actual data so I have made up a similar situation using student courses.
在实际数据中，有6种可能的状态。在这里我减少到了3种。我大约有30万行数据，并且预计不久将增长到100倍。
I need to set a value in a column based upon the evaluation of another column across multiple rows that share an index.
我需要根据对共享索引的多行的另一列的评估来设置列中的一个值。
I can do this using iteration or a lambda function but they both take over 5 minutes on a Mac (it will be 10x faster at least when deployed to Linux) to run and when my data is increased, neither solutions will not be an option. I need to find out how to vectorize this.
我可以使用迭代或lambda函数来做到这一点，但它们在Mac上运行需要超过5分钟（在部署到Linux时至少会快10倍），而且当我的数据增加时，这两种解决方案都不是选项。我需要找出如何进行向量化。

Data snippet:
数据片段：

d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
     {'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
     {'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
     {'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
     {'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
     {'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
     {'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
     {'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
     {'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
     {'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]

df = pd.DataFrame(d)
df.course_status = ''
# Create a tuple so the student and course can be tracked as one index:
df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)
# Set the tuple as the index (hoping to get a bit of performance boost)
df = df.set_index(['name_course'], drop=True)

name_course	name	course	exam_status
('Student1', 'Math')	Student1	Math	Pass
('Student1', 'Math')	Student1	Math	Fail
('Student1', 'Math')	Student1	Math	Pass
('Student1', 'Chemistry')	Student1	Chemistry	Pass
('Student1', 'Chemistry')	Student1	Chemistry	Pass
('Student1', 'Chemistry')	Student1	Chemistry	Pass
('Student1', 'Art')	Student1	Art	Absent
('Student1', 'Art')	Student1	Art	Absent
('Student1', 'Art')	Student1	Art	Fail
('Student1', 'Computer Science')	Student1	Computer Science	Fail
('Student1', 'Computer Science')	Student1	Computer Science	Fail
('Student1', 'Computer Science')	Student1	Computer Science	Pass
('Student2', 'Math')	Student2	Math	Pass
('Student2', 'Math')	Student2	Math	Pass
('Student2', 'Math')	Student2	Math	Pass
('Student2', 'Chemistry')	Student2	Chemistry	Fail
('Student2', 'Chemistry')	Student2	Chemistry	Pass
('Student2', 'Chemistry')	Student2	Chemistry	Absent
('Student2', 'Art')	Student2	Art	Pass
('Student2', 'Art')	Student2	Art	Pass
('Student2', 'Art')	Student2	Art	Fail
('Student2', 'Computer Science')	Student2	Computer Science	Pass
('Student2', 'Computer Science')	Student2	Computer Science	Pass
('Student2', 'Computer Science')	Student2	Computer Science	Pass

Both of these coding techniques below work but basically iterate over the unique student/course tuple and with hundreds of millions of rows will be too slow:

下面的这两种编码技巧都能工作，但基本上是迭代唯一的学生/课程元组，对于数亿行来说将会太慢：

Approach 1: Iteration using a unique index which greatly cuts down on the loop count because there are about 50-150 rows for each index in the full data. However, I cannot figure out how to get rid of the multiple loc commands in the script. I cannot use an 'at' because there are multiple rows returned by loc. At any case, if I could do that, it would only make it as fast as the lambda approach.

方法1：使用唯一索引进行迭代，这大大减少了循环计数，因为完整数据中每个索引大约有50-150行。但是，我无法弄清楚如何在脚本中摆脱多个loc命令。我不能使用“at”因为loc返回了多行。无论如何，如果我能做到这一点，它只会使速度与lambda方法一样快。

for i in df.index.unique():
    status

<details>
<summary>英文:</summary>

I cannot share the actual data so I have made up a similar situation using student courses.
In the real data there are 6 possible statuses.  I have cut it down to 3 here.  I have approximately 300k rows and am expecting this to grow by 100x soon.
I need to set a value in a column based upon the evaluation of another column across multiple rows that share an index.
I can do this using iteration or a lambda function but they both take over 5 minutes on a Mac (it will be 10x faster at least when deployed to Linux) to run and when my data is increased, neither solutions will not be an option.  I need to find out how to vectorize this.

Data snippet:

d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]

df = pd.DataFrame(d)
df.course_status = ''

Create a tuple so the student and course can be tracked as one index:

df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)

Set the tuple as the index (hoping to get a bit of performance boost)

df = df.set_index(['name_course'],drop=True)


| name_course | name | course | exam_status |
|:--- |:--- |:--- |:--- |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Pass |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Fail |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Absent |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Absent |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Fail |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Pass |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Absent |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Pass |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Pass |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Fail |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass |
Both of these coding techniques below work but basically iterate over the unique student/course tuple and with hundreds of millions of rows will be too slow:
Approach 1: Iteration using a unique index which greatly cuts down on the loop count because there are about 50-150 rows for each index in the full data. However, I cannot figure out how to get rid of the multiple loc commands in the script.  I cannot use an &#39;at&#39; because there are multiple rows returned by loc.  At any case, if I could do that, it would only make it as fast as the lambda approach.

for i in df.index.unique():
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: df.loc[[i], 'course_status'] = 'Fail'
elif 'Absent' in status_list: df.loc[[i], 'course_status'] = 'Absent'
elif 'Pass' or 'Skip' in status_list: df.loc[[i], 'course_status'] = 'Pass'
print(df.to_markdown())


print(df.to_markdown())
| name_course | name | course | exam_status | course_status |
|:--- |:--- |:--- |:--- |:--- |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Pass | Fail |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Fail | Fail |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Pass | Fail |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass | Pass |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Absent | Fail |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Absent | Fail |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Fail | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Fail | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Fail | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Pass | Fail |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass | Pass |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Fail | Fail |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Pass | Fail |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Absent | Fail |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Pass | Fail |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Pass | Fail |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Fail | Fail |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass | Pass |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass | Pass |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass | Pass |
Approach 2: Using a lambda instead.  This works faster due to the lack of multiple loc commands inside the function, but still does not use vectorization and is quite slow.

Set course_status based on rule statuses for that audit

def set_course_status(i):
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: return 'Fail'
elif 'Absent' in status_list: return 'Absent'
elif 'Pass' in status_list: return 'Pass'
df['course_status'] = df.groupby(df.index).apply(lambda x : set_course_status(x.name))


The results are the same as above.
Can anyone please come up with a way to do this using vectorization or some really fast method?
</details>
# 答案1
**得分**: 1
以下是翻译好的代码部分：
```python
import pandas as pd
d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]
df = pd.DataFrame(d)
df.course_status = ''
# 创建一个元组，以便学生和课程可以作为一个索引跟踪：
df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)
# 将元组设置为索引（希望能获得一些性能提升）
df = df.set_index(['name_course'], drop=True)
# 状态的顺序
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# 将状态映射为数字
df["course_status"] = df.exam_status.map(status_map)
# 获取每个name_course的最小course_status
df["course_status"] = df.groupby([
"name_course",
]).course_status.transform("min").map(
# 将映射回原始状态名称
{v: k for k, v in status_map.items()}
)
print(df)

# 时间比较通过%%timeit
# 你的第一次尝试
%%timeit
for i in df.index.unique():
    status_list = list(df.loc[[i]]['exam_status'])
    if 'Fail' in status_list: df.loc[[i], 'course_status']  = 'Fail'
    elif 'Absent' in status_list: df.loc[[i], 'course_status']  = 'Absent'
    elif 'Pass' or 'Skip' in status_list: df.loc[[i], 'course_status']  = 'Pass'

# 你的第二次尝试
%%timeit
# 根据规则为该审计设置course_status
def set_course_status(i):
    status_list = list(df.loc[[i]]['exam_status'])
    if 'Fail' in status_list: return 'Fail'
    elif 'Absent' in status_list: return 'Absent'
    elif 'Pass' in status_list: return 'Pass'
df['course_status'] = df.groupby(df.index).apply(lambda x : set_course_status(x.name))

# 我的尝试
%%timeit
# 状态的顺序
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# 将状态映射为数字
df["course_status"] = df.exam_status.map(status_map)
# 获取每个name_course的最小course_status
df["course_status"] = df.groupby([
    "name_course",
]).course_status.transform("min").map(
    # 将映射回原始状态名称
    {v: k for k, v in status_map.items()}
)

# 在10,000倍的行上进行测试
# 240_000行而不是24行
big = pd.concat((df for _ in range(10_000)))

# 注意：我将`df`替换为`big`
%%timeit
# 状态的顺序
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# 将状态映射为数字
big["course_status"] = big.exam_status.map(status_map)
# 获取每个name_course的最小course_status
big["course_status"] = big.groupby([
    "name_course",
]).course_status.transform("min").map(
    # 将映射回原始状态名称
    {v: k for k, v in status_map.items()}
)

英文:

How's this?

import pandas as pd


d = [{&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Absent&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Absent&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Absent&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},]

df = pd.DataFrame(d)
df.course_status = &#39;&#39;
# Create a tuple so the student and course can be tracked as one index:
df[&#39;name_course&#39;] = df[[&#39;name&#39;, &#39;course&#39;]].apply(tuple, axis=1)
# Set the tuple as the index (hoping to get a bit of performance boost)
df = df.set_index([&#39;name_course&#39;],drop=True)

# ordering of statuses
status_map = {&quot;Fail&quot;: 0, &quot;Absent&quot;: 1, &quot;Pass&quot;: 2}
# map statuses to numeric
df[&quot;course_status&quot;] = df.exam_status.map(status_map)
# get the min course_status per name_course
df[&quot;course_status&quot;] = df.groupby([
    &quot;name_course&quot;,
]).course_status.transform(&quot;min&quot;).map(
    # map back to original status names
    {v: k for k, v in status_map.items()}
)

print(df)

                                  name            course exam_status course_status
name_course                                                                       
(Student1, Math)              Student1              Math        Pass          Fail
(Student1, Math)              Student1              Math        Fail          Fail
(Student1, Math)              Student1              Math        Pass          Fail
(Student1, Chemistry)         Student1         Chemistry        Pass          Pass
(Student1, Chemistry)         Student1         Chemistry        Pass          Pass
(Student1, Chemistry)         Student1         Chemistry        Pass          Pass
(Student1, Art)               Student1               Art      Absent          Fail
(Student1, Art)               Student1               Art      Absent          Fail
(Student1, Art)               Student1               Art        Fail          Fail
(Student1, Computer Science)  Student1  Computer Science        Fail          Fail
(Student1, Computer Science)  Student1  Computer Science        Fail          Fail
(Student1, Computer Science)  Student1  Computer Science        Pass          Fail
(Student2, Math)              Student2              Math        Pass          Pass
(Student2, Math)              Student2              Math        Pass          Pass
(Student2, Math)              Student2              Math        Pass          Pass
(Student2, Chemistry)         Student2         Chemistry        Fail          Fail
(Student2, Chemistry)         Student2         Chemistry        Pass          Fail
(Student2, Chemistry)         Student2         Chemistry      Absent          Fail
(Student2, Art)               Student2               Art        Pass          Fail
(Student2, Art)               Student2               Art        Pass          Fail
(Student2, Art)               Student2               Art        Fail          Fail
(Student2, Computer Science)  Student2  Computer Science        Pass          Pass
(Student2, Computer Science)  Student2  Computer Science        Pass          Pass
(Student2, Computer Science)  Student2  Computer Science        Pass          Pass

Time comparisons via %%timeit

# your first attempt
%%timeit
for i in df.index.unique():
    status_list = list(df.loc[[i]][&#39;exam_status&#39;])
    if &#39;Fail&#39; in status_list: df.loc[[i], &#39;course_status&#39;]  = &#39;Fail&#39;
    elif &#39;Absent&#39; in status_list: df.loc[[i], &#39;course_status&#39;]  = &#39;Absent&#39;
    elif &#39;Pass&#39; or &#39;Skip&#39; in status_list: df.loc[[i], &#39;course_status&#39;]  = &#39;Pass&#39;

8.62 ms &#177; 1.91 ms per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

# your second attempt
%%timeit
# Set course_status based on rule statuses for that audit
def set_course_status(i):
    status_list = list(df.loc[[i]][&#39;exam_status&#39;])
    if &#39;Fail&#39; in status_list: return &#39;Fail&#39;
    elif &#39;Absent&#39; in status_list: return &#39;Absent&#39;
    elif &#39;Pass&#39; in status_list: return &#39;Pass&#39;
df[&#39;course_status&#39;] = df.groupby(df.index).apply(lambda x : set_course_status(x.name))

6.79 ms &#177; 561 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

# my attempt
%%timeit
# ordering of statuses
status_map = {&quot;Fail&quot;: 0, &quot;Absent&quot;: 1, &quot;Pass&quot;: 2}
# map statuses to numeric
df[&quot;course_status&quot;] = df.exam_status.map(status_map)
# get the min course_status per name_course
df[&quot;course_status&quot;] = df.groupby([
    &quot;name_course&quot;,
]).course_status.transform(&quot;min&quot;).map(
    # map back to original status names
    {v: k for k, v in status_map.items()}
)

1.99 ms &#177; 191 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)

Testing on 10,000x the rows

# 240_000 rows instead of 24
big = pd.concat((df for _ in range(10_000)))

# NOTE: I swapped `df` for `big`
%%timeit
# ordering of statuses
status_map = {&quot;Fail&quot;: 0, &quot;Absent&quot;: 1, &quot;Pass&quot;: 2}
# map statuses to numeric
big[&quot;course_status&quot;] = big.exam_status.map(status_map)
# get the min course_status per name_course
big[&quot;course_status&quot;] = big.groupby([
    &quot;name_course&quot;,
]).course_status.transform(&quot;min&quot;).map(
    # map back to original status names
    {v: k for k, v in status_map.items()}
)

33.7 ms &#177; 1.89 ms per loop (mean &#177; std. dev. of 7 runs, 10 loops each)

答案2

得分: 1

def addCourseStatus(df):
    has = pd.DataFrame({'absent': df.exam_status == 'Absent', 'fail': df.exam_status == 'Fail', 'pas': df.exam_status == 'Pass'}).groupby(level=0).any()
    df.loc[has.fail, 'course_status'] = 'Fail'
    df.loc[~has.fail & has.absent, 'course_status'] = 'Absent'
    df.loc[~has.fail & ~has.absent & has.pas, 'course_status'] = 'Pass'

英文:


def addCourseStatus(df):
    has=pd.DataFrame({&#39;absent&#39;:df.exam_status==&#39;Absent&#39;, &#39;fail&#39;:df.exam_status==&#39;Fail&#39;, &#39;pas&#39;:df.exam_status==&#39;Pass&#39;}).groupby(level=0).any()
    df.loc[has.fail,&#39;course_status&#39;]=&#39;Fail&#39;
    df.loc[~has.fail &amp; has.absent, &#39;course_status&#39;]=&#39;Absent&#39;
    df.loc[~has.fail &amp; ~has.absent &amp; has.pas, &#39;course_status&#39;]=&#39;Pass&#39;

Probably not optimal. I don't like the creation of a temporary dataframe. But vectorized.

It is faster, but barely (some 30% faster) than your two solutions for your example. Because it is vectorized indeed, but that are a few of those vectorized operations (3 comparisons, then the creation of the dataframe, then 3 booleans/loc operations).

But the more rows there will be, the more the vectorization will show, and the less those 7 operations will matter.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Vectorize the assignment of a column in a pandas dataframe where a custom index has many rows and the column value is set using all rows in the index

问题

Create a tuple so the student and course can be tracked as one index:

Set the tuple as the index (hoping to get a bit of performance boost)

Set course_status based on rule statuses for that audit

Time comparisons via %%timeit

Testing on 10,000x the rows

答案2

为什么为方法描述符设置set函数不会使其成为数据描述符？

如何让Keras在GPU上运行？

Is there a way to execute something in the console of a website without opening the DevTools?

如何确保我的约束能够在数学上与 LpVariable 协同工作？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论