Vectorize the assignment of a column in a pandas dataframe where a custom index has many rows and the column value is set using all rows in the index

huangapple go评论58阅读模式
英文:

Vectorize the assignment of a column in a pandas dataframe where a custom index has many rows and the column value is set using all rows in the index

问题

I cannot share the actual data so I have made up a similar situation using student courses.
在实际数据中,有6种可能的状态。在这里我减少到了3种。我大约有30万行数据,并且预计不久将增长到100倍。
I need to set a value in a column based upon the evaluation of another column across multiple rows that share an index.
我需要根据对共享索引的多行的另一列的评估来设置列中的一个值。
I can do this using iteration or a lambda function but they both take over 5 minutes on a Mac (it will be 10x faster at least when deployed to Linux) to run and when my data is increased, neither solutions will not be an option. I need to find out how to vectorize this.
我可以使用迭代或lambda函数来做到这一点,但它们在Mac上运行需要超过5分钟(在部署到Linux时至少会快10倍),而且当我的数据增加时,这两种解决方案都不是选项。我需要找出如何进行向量化。

Data snippet:
数据片段:

d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
     {'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
     {'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
     {'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
     {'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
     {'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
     {'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
     {'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
     {'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
     {'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
     {'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
     {'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]

df = pd.DataFrame(d)
df.course_status = ''
# Create a tuple so the student and course can be tracked as one index:
df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)
# Set the tuple as the index (hoping to get a bit of performance boost)
df = df.set_index(['name_course'], drop=True)
name_course name course exam_status
('Student1', 'Math') Student1 Math Pass
('Student1', 'Math') Student1 Math Fail
('Student1', 'Math') Student1 Math Pass
('Student1', 'Chemistry') Student1 Chemistry Pass
('Student1', 'Chemistry') Student1 Chemistry Pass
('Student1', 'Chemistry') Student1 Chemistry Pass
('Student1', 'Art') Student1 Art Absent
('Student1', 'Art') Student1 Art Absent
('Student1', 'Art') Student1 Art Fail
('Student1', 'Computer Science') Student1 Computer Science Fail
('Student1', 'Computer Science') Student1 Computer Science Fail
('Student1', 'Computer Science') Student1 Computer Science Pass
('Student2', 'Math') Student2 Math Pass
('Student2', 'Math') Student2 Math Pass
('Student2', 'Math') Student2 Math Pass
('Student2', 'Chemistry') Student2 Chemistry Fail
('Student2', 'Chemistry') Student2 Chemistry Pass
('Student2', 'Chemistry') Student2 Chemistry Absent
('Student2', 'Art') Student2 Art Pass
('Student2', 'Art') Student2 Art Pass
('Student2', 'Art') Student2 Art Fail
('Student2', 'Computer Science') Student2 Computer Science Pass
('Student2', 'Computer Science') Student2 Computer Science Pass
('Student2', 'Computer Science') Student2 Computer Science Pass

Both of these coding techniques below work but basically iterate over the unique student/course tuple and with hundreds of millions of rows will be too slow:

下面的这两种编码技巧都能工作,但基本上是迭代唯一的学生/课程元组,对于数亿行来说将会太慢:

Approach 1: Iteration using a unique index which greatly cuts down on the loop count because there are about 50-150 rows for each index in the full data. However, I cannot figure out how to get rid of the multiple loc commands in the script. I cannot use an 'at' because there are multiple rows returned by loc. At any case, if I could do that, it would only make it as fast as the lambda approach.

方法1:使用唯一索引进行迭代,这大大减少了循环计数,因为完整数据中每个索引大约有50-150行。但是,我无法弄清楚如何在脚本中摆脱多个loc命令。我不能使用“at”因为loc返回了多行。无论如何,如果我能做到这一点,它只会使速度与lambda方法一样快。

for i in df.index.unique():
    status

<details>
<summary>英文:</summary>

I cannot share the actual data so I have made up a similar situation using student courses.
In the real data there are 6 possible statuses.  I have cut it down to 3 here.  I have approximately 300k rows and am expecting this to grow by 100x soon.
I need to set a value in a column based upon the evaluation of another column across multiple rows that share an index.
I can do this using iteration or a lambda function but they both take over 5 minutes on a Mac (it will be 10x faster at least when deployed to Linux) to run and when my data is increased, neither solutions will not be an option.  I need to find out how to vectorize this.

Data snippet:

d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]

df = pd.DataFrame(d)
df.course_status = ''

Create a tuple so the student and course can be tracked as one index:

df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)

Set the tuple as the index (hoping to get a bit of performance boost)

df = df.set_index(['name_course'],drop=True)


| name_course | name | course | exam_status |
|:--- |:--- |:--- |:--- |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Pass |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Fail |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Absent |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Absent |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Fail |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Pass |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Absent |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Pass |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Pass |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Fail |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass |
Both of these coding techniques below work but basically iterate over the unique student/course tuple and with hundreds of millions of rows will be too slow:
Approach 1: Iteration using a unique index which greatly cuts down on the loop count because there are about 50-150 rows for each index in the full data. However, I cannot figure out how to get rid of the multiple loc commands in the script.  I cannot use an &#39;at&#39; because there are multiple rows returned by loc.  At any case, if I could do that, it would only make it as fast as the lambda approach.

for i in df.index.unique():
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: df.loc[[i], 'course_status'] = 'Fail'
elif 'Absent' in status_list: df.loc[[i], 'course_status'] = 'Absent'
elif 'Pass' or 'Skip' in status_list: df.loc[[i], 'course_status'] = 'Pass'
print(df.to_markdown())


print(df.to_markdown())
| name_course | name | course | exam_status | course_status |
|:--- |:--- |:--- |:--- |:--- |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Pass | Fail |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Fail | Fail |
| (&#39;Student1&#39;, &#39;Math&#39;) | Student1 | Math | Pass | Fail |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass | Pass |
| (&#39;Student1&#39;, &#39;Chemistry&#39;) | Student1 | Chemistry | Pass | Pass |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Absent | Fail |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Absent | Fail |
| (&#39;Student1&#39;, &#39;Art&#39;) | Student1 | Art | Fail | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Fail | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Fail | Fail |
| (&#39;Student1&#39;, &#39;Computer Science&#39;) | Student1 | Computer Science | Pass | Fail |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass | Pass |
| (&#39;Student2&#39;, &#39;Math&#39;) | Student2 | Math | Pass | Pass |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Fail | Fail |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Pass | Fail |
| (&#39;Student2&#39;, &#39;Chemistry&#39;) | Student2 | Chemistry | Absent | Fail |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Pass | Fail |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Pass | Fail |
| (&#39;Student2&#39;, &#39;Art&#39;) | Student2 | Art | Fail | Fail |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass | Pass |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass | Pass |
| (&#39;Student2&#39;, &#39;Computer Science&#39;) | Student2 | Computer Science | Pass | Pass |
Approach 2: Using a lambda instead.  This works faster due to the lack of multiple loc commands inside the function, but still does not use vectorization and is quite slow.

Set course_status based on rule statuses for that audit

def set_course_status(i):
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: return 'Fail'
elif 'Absent' in status_list: return 'Absent'
elif 'Pass' in status_list: return 'Pass'
df['course_status'] = df.groupby(df.index).apply(lambda x : set_course_status(x.name))


The results are the same as above.
Can anyone please come up with a way to do this using vectorization or some really fast method?
</details>
# 答案1
**得分**: 1
以下是翻译好的代码部分:
```python
import pandas as pd
d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]
df = pd.DataFrame(d)
df.course_status = ''
# 创建一个元组,以便学生和课程可以作为一个索引跟踪:
df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)
# 将元组设置为索引(希望能获得一些性能提升)
df = df.set_index(['name_course'], drop=True)
# 状态的顺序
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# 将状态映射为数字
df["course_status"] = df.exam_status.map(status_map)
# 获取每个name_course的最小course_status
df["course_status"] = df.groupby([
"name_course",
]).course_status.transform("min").map(
# 将映射回原始状态名称
{v: k for k, v in status_map.items()}
)
print(df)
# 时间比较通过%%timeit
# 你的第一次尝试
%%timeit
for i in df.index.unique():
    status_list = list(df.loc[[i]]['exam_status'])
    if 'Fail' in status_list: df.loc[[i], 'course_status']  = 'Fail'
    elif 'Absent' in status_list: df.loc[[i], 'course_status']  = 'Absent'
    elif 'Pass' or 'Skip' in status_list: df.loc[[i], 'course_status']  = 'Pass'
# 你的第二次尝试
%%timeit
# 根据规则为该审计设置course_status
def set_course_status(i):
    status_list = list(df.loc[[i]]['exam_status'])
    if 'Fail' in status_list: return 'Fail'
    elif 'Absent' in status_list: return 'Absent'
    elif 'Pass' in status_list: return 'Pass'
df['course_status'] = df.groupby(df.index).apply(lambda x : set_course_status(x.name))
# 我的尝试
%%timeit
# 状态的顺序
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# 将状态映射为数字
df["course_status"] = df.exam_status.map(status_map)
# 获取每个name_course的最小course_status
df["course_status"] = df.groupby([
    "name_course",
]).course_status.transform("min").map(
    # 将映射回原始状态名称
    {v: k for k, v in status_map.items()}
)
# 在10,000倍的行上进行测试
# 240_000行而不是24行
big = pd.concat((df for _ in range(10_000)))
# 注意:我将`df`替换为`big`
%%timeit
# 状态的顺序
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# 将状态映射为数字
big["course_status"] = big.exam_status.map(status_map)
# 获取每个name_course的最小course_status
big["course_status"] = big.groupby([
    "name_course",
]).course_status.transform("min").map(
    # 将映射回原始状态名称
    {v: k for k, v in status_map.items()}
)
英文:

How's this?

import pandas as pd


d = [{&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Absent&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Absent&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student1&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Math&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Chemistry&#39;, &#39;exam_status&#39;:&#39;Absent&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Art&#39;, &#39;exam_status&#39;:&#39;Fail&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},
     {&#39;name&#39;:&#39;Student2&#39;, &#39;course&#39;:&#39;Computer Science&#39;, &#39;exam_status&#39;:&#39;Pass&#39;},]

df = pd.DataFrame(d)
df.course_status = &#39;&#39;
# Create a tuple so the student and course can be tracked as one index:
df[&#39;name_course&#39;] = df[[&#39;name&#39;, &#39;course&#39;]].apply(tuple, axis=1)
# Set the tuple as the index (hoping to get a bit of performance boost)
df = df.set_index([&#39;name_course&#39;],drop=True)

# ordering of statuses
status_map = {&quot;Fail&quot;: 0, &quot;Absent&quot;: 1, &quot;Pass&quot;: 2}
# map statuses to numeric
df[&quot;course_status&quot;] = df.exam_status.map(status_map)
# get the min course_status per name_course
df[&quot;course_status&quot;] = df.groupby([
    &quot;name_course&quot;,
]).course_status.transform(&quot;min&quot;).map(
    # map back to original status names
    {v: k for k, v in status_map.items()}
)

print(df)
                                  name            course exam_status course_status
name_course                                                                       
(Student1, Math)              Student1              Math        Pass          Fail
(Student1, Math)              Student1              Math        Fail          Fail
(Student1, Math)              Student1              Math        Pass          Fail
(Student1, Chemistry)         Student1         Chemistry        Pass          Pass
(Student1, Chemistry)         Student1         Chemistry        Pass          Pass
(Student1, Chemistry)         Student1         Chemistry        Pass          Pass
(Student1, Art)               Student1               Art      Absent          Fail
(Student1, Art)               Student1               Art      Absent          Fail
(Student1, Art)               Student1               Art        Fail          Fail
(Student1, Computer Science)  Student1  Computer Science        Fail          Fail
(Student1, Computer Science)  Student1  Computer Science        Fail          Fail
(Student1, Computer Science)  Student1  Computer Science        Pass          Fail
(Student2, Math)              Student2              Math        Pass          Pass
(Student2, Math)              Student2              Math        Pass          Pass
(Student2, Math)              Student2              Math        Pass          Pass
(Student2, Chemistry)         Student2         Chemistry        Fail          Fail
(Student2, Chemistry)         Student2         Chemistry        Pass          Fail
(Student2, Chemistry)         Student2         Chemistry      Absent          Fail
(Student2, Art)               Student2               Art        Pass          Fail
(Student2, Art)               Student2               Art        Pass          Fail
(Student2, Art)               Student2               Art        Fail          Fail
(Student2, Computer Science)  Student2  Computer Science        Pass          Pass
(Student2, Computer Science)  Student2  Computer Science        Pass          Pass
(Student2, Computer Science)  Student2  Computer Science        Pass          Pass

Time comparisons via %%timeit

# your first attempt
%%timeit
for i in df.index.unique():
    status_list = list(df.loc[[i]][&#39;exam_status&#39;])
    if &#39;Fail&#39; in status_list: df.loc[[i], &#39;course_status&#39;]  = &#39;Fail&#39;
    elif &#39;Absent&#39; in status_list: df.loc[[i], &#39;course_status&#39;]  = &#39;Absent&#39;
    elif &#39;Pass&#39; or &#39;Skip&#39; in status_list: df.loc[[i], &#39;course_status&#39;]  = &#39;Pass&#39;
8.62 ms &#177; 1.91 ms per loop (mean &#177; std. dev. of 7 runs, 100 loops each)
# your second attempt
%%timeit
# Set course_status based on rule statuses for that audit
def set_course_status(i):
    status_list = list(df.loc[[i]][&#39;exam_status&#39;])
    if &#39;Fail&#39; in status_list: return &#39;Fail&#39;
    elif &#39;Absent&#39; in status_list: return &#39;Absent&#39;
    elif &#39;Pass&#39; in status_list: return &#39;Pass&#39;
df[&#39;course_status&#39;] = df.groupby(df.index).apply(lambda x : set_course_status(x.name))
6.79 ms &#177; 561 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)
# my attempt
%%timeit
# ordering of statuses
status_map = {&quot;Fail&quot;: 0, &quot;Absent&quot;: 1, &quot;Pass&quot;: 2}
# map statuses to numeric
df[&quot;course_status&quot;] = df.exam_status.map(status_map)
# get the min course_status per name_course
df[&quot;course_status&quot;] = df.groupby([
    &quot;name_course&quot;,
]).course_status.transform(&quot;min&quot;).map(
    # map back to original status names
    {v: k for k, v in status_map.items()}
)
1.99 ms &#177; 191 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)

Testing on 10,000x the rows

# 240_000 rows instead of 24
big = pd.concat((df for _ in range(10_000)))
# NOTE: I swapped `df` for `big`
%%timeit
# ordering of statuses
status_map = {&quot;Fail&quot;: 0, &quot;Absent&quot;: 1, &quot;Pass&quot;: 2}
# map statuses to numeric
big[&quot;course_status&quot;] = big.exam_status.map(status_map)
# get the min course_status per name_course
big[&quot;course_status&quot;] = big.groupby([
    &quot;name_course&quot;,
]).course_status.transform(&quot;min&quot;).map(
    # map back to original status names
    {v: k for k, v in status_map.items()}
)
33.7 ms &#177; 1.89 ms per loop (mean &#177; std. dev. of 7 runs, 10 loops each)

答案2

得分: 1

def addCourseStatus(df):
    has = pd.DataFrame({'absent': df.exam_status == 'Absent', 'fail': df.exam_status == 'Fail', 'pas': df.exam_status == 'Pass'}).groupby(level=0).any()
    df.loc[has.fail, 'course_status'] = 'Fail'
    df.loc[~has.fail & has.absent, 'course_status'] = 'Absent'
    df.loc[~has.fail & ~has.absent & has.pas, 'course_status'] = 'Pass'
英文:

def addCourseStatus(df):
    has=pd.DataFrame({&#39;absent&#39;:df.exam_status==&#39;Absent&#39;, &#39;fail&#39;:df.exam_status==&#39;Fail&#39;, &#39;pas&#39;:df.exam_status==&#39;Pass&#39;}).groupby(level=0).any()
    df.loc[has.fail,&#39;course_status&#39;]=&#39;Fail&#39;
    df.loc[~has.fail &amp; has.absent, &#39;course_status&#39;]=&#39;Absent&#39;
    df.loc[~has.fail &amp; ~has.absent &amp; has.pas, &#39;course_status&#39;]=&#39;Pass&#39;

Probably not optimal. I don't like the creation of a temporary dataframe. But vectorized.

It is faster, but barely (some 30% faster) than your two solutions for your example. Because it is vectorized indeed, but that are a few of those vectorized operations (3 comparisons, then the creation of the dataframe, then 3 booleans/loc operations).

But the more rows there will be, the more the vectorization will show, and the less those 7 operations will matter.

huangapple
  • 本文由 发表于 2023年7月27日 22:03:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76780552.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定