英文:
Vectorize the assignment of a column in a pandas dataframe where a custom index has many rows and the column value is set using all rows in the index
问题
I cannot share the actual data so I have made up a similar situation using student courses.
在实际数据中,有6种可能的状态。在这里我减少到了3种。我大约有30万行数据,并且预计不久将增长到100倍。
I need to set a value in a column based upon the evaluation of another column across multiple rows that share an index.
我需要根据对共享索引的多行的另一列的评估来设置列中的一个值。
I can do this using iteration or a lambda function but they both take over 5 minutes on a Mac (it will be 10x faster at least when deployed to Linux) to run and when my data is increased, neither solutions will not be an option. I need to find out how to vectorize this.
我可以使用迭代或lambda函数来做到这一点,但它们在Mac上运行需要超过5分钟(在部署到Linux时至少会快10倍),而且当我的数据增加时,这两种解决方案都不是选项。我需要找出如何进行向量化。
Data snippet:
数据片段:
d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]
df = pd.DataFrame(d)
df.course_status = ''
# Create a tuple so the student and course can be tracked as one index:
df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)
# Set the tuple as the index (hoping to get a bit of performance boost)
df = df.set_index(['name_course'], drop=True)
name_course | name | course | exam_status |
---|---|---|---|
('Student1', 'Math') | Student1 | Math | Pass |
('Student1', 'Math') | Student1 | Math | Fail |
('Student1', 'Math') | Student1 | Math | Pass |
('Student1', 'Chemistry') | Student1 | Chemistry | Pass |
('Student1', 'Chemistry') | Student1 | Chemistry | Pass |
('Student1', 'Chemistry') | Student1 | Chemistry | Pass |
('Student1', 'Art') | Student1 | Art | Absent |
('Student1', 'Art') | Student1 | Art | Absent |
('Student1', 'Art') | Student1 | Art | Fail |
('Student1', 'Computer Science') | Student1 | Computer Science | Fail |
('Student1', 'Computer Science') | Student1 | Computer Science | Fail |
('Student1', 'Computer Science') | Student1 | Computer Science | Pass |
('Student2', 'Math') | Student2 | Math | Pass |
('Student2', 'Math') | Student2 | Math | Pass |
('Student2', 'Math') | Student2 | Math | Pass |
('Student2', 'Chemistry') | Student2 | Chemistry | Fail |
('Student2', 'Chemistry') | Student2 | Chemistry | Pass |
('Student2', 'Chemistry') | Student2 | Chemistry | Absent |
('Student2', 'Art') | Student2 | Art | Pass |
('Student2', 'Art') | Student2 | Art | Pass |
('Student2', 'Art') | Student2 | Art | Fail |
('Student2', 'Computer Science') | Student2 | Computer Science | Pass |
('Student2', 'Computer Science') | Student2 | Computer Science | Pass |
('Student2', 'Computer Science') | Student2 | Computer Science | Pass |
Both of these coding techniques below work but basically iterate over the unique student/course tuple and with hundreds of millions of rows will be too slow:
下面的这两种编码技巧都能工作,但基本上是迭代唯一的学生/课程元组,对于数亿行来说将会太慢:
Approach 1: Iteration using a unique index which greatly cuts down on the loop count because there are about 50-150 rows for each index in the full data. However, I cannot figure out how to get rid of the multiple loc commands in the script. I cannot use an 'at' because there are multiple rows returned by loc. At any case, if I could do that, it would only make it as fast as the lambda approach.
方法1:使用唯一索引进行迭代,这大大减少了循环计数,因为完整数据中每个索引大约有50-150行。但是,我无法弄清楚如何在脚本中摆脱多个loc命令。我不能使用“at”因为loc返回了多行。无论如何,如果我能做到这一点,它只会使速度与lambda方法一样快。
for i in df.index.unique():
status
<details>
<summary>英文:</summary>
I cannot share the actual data so I have made up a similar situation using student courses.
In the real data there are 6 possible statuses. I have cut it down to 3 here. I have approximately 300k rows and am expecting this to grow by 100x soon.
I need to set a value in a column based upon the evaluation of another column across multiple rows that share an index.
I can do this using iteration or a lambda function but they both take over 5 minutes on a Mac (it will be 10x faster at least when deployed to Linux) to run and when my data is increased, neither solutions will not be an option. I need to find out how to vectorize this.
Data snippet:
d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]
df = pd.DataFrame(d)
df.course_status = ''
Create a tuple so the student and course can be tracked as one index:
df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)
Set the tuple as the index (hoping to get a bit of performance boost)
df = df.set_index(['name_course'],drop=True)
| name_course | name | course | exam_status |
|:--- |:--- |:--- |:--- |
| ('Student1', 'Math') | Student1 | Math | Pass |
| ('Student1', 'Math') | Student1 | Math | Fail |
| ('Student1', 'Math') | Student1 | Math | Pass |
| ('Student1', 'Chemistry') | Student1 | Chemistry | Pass |
| ('Student1', 'Chemistry') | Student1 | Chemistry | Pass |
| ('Student1', 'Chemistry') | Student1 | Chemistry | Pass |
| ('Student1', 'Art') | Student1 | Art | Absent |
| ('Student1', 'Art') | Student1 | Art | Absent |
| ('Student1', 'Art') | Student1 | Art | Fail |
| ('Student1', 'Computer Science') | Student1 | Computer Science | Fail |
| ('Student1', 'Computer Science') | Student1 | Computer Science | Fail |
| ('Student1', 'Computer Science') | Student1 | Computer Science | Pass |
| ('Student2', 'Math') | Student2 | Math | Pass |
| ('Student2', 'Math') | Student2 | Math | Pass |
| ('Student2', 'Math') | Student2 | Math | Pass |
| ('Student2', 'Chemistry') | Student2 | Chemistry | Fail |
| ('Student2', 'Chemistry') | Student2 | Chemistry | Pass |
| ('Student2', 'Chemistry') | Student2 | Chemistry | Absent |
| ('Student2', 'Art') | Student2 | Art | Pass |
| ('Student2', 'Art') | Student2 | Art | Pass |
| ('Student2', 'Art') | Student2 | Art | Fail |
| ('Student2', 'Computer Science') | Student2 | Computer Science | Pass |
| ('Student2', 'Computer Science') | Student2 | Computer Science | Pass |
| ('Student2', 'Computer Science') | Student2 | Computer Science | Pass |
Both of these coding techniques below work but basically iterate over the unique student/course tuple and with hundreds of millions of rows will be too slow:
Approach 1: Iteration using a unique index which greatly cuts down on the loop count because there are about 50-150 rows for each index in the full data. However, I cannot figure out how to get rid of the multiple loc commands in the script. I cannot use an 'at' because there are multiple rows returned by loc. At any case, if I could do that, it would only make it as fast as the lambda approach.
for i in df.index.unique():
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: df.loc[[i], 'course_status'] = 'Fail'
elif 'Absent' in status_list: df.loc[[i], 'course_status'] = 'Absent'
elif 'Pass' or 'Skip' in status_list: df.loc[[i], 'course_status'] = 'Pass'
print(df.to_markdown())
print(df.to_markdown())
| name_course | name | course | exam_status | course_status |
|:--- |:--- |:--- |:--- |:--- |
| ('Student1', 'Math') | Student1 | Math | Pass | Fail |
| ('Student1', 'Math') | Student1 | Math | Fail | Fail |
| ('Student1', 'Math') | Student1 | Math | Pass | Fail |
| ('Student1', 'Chemistry') | Student1 | Chemistry | Pass | Pass |
| ('Student1', 'Chemistry') | Student1 | Chemistry | Pass | Pass |
| ('Student1', 'Chemistry') | Student1 | Chemistry | Pass | Pass |
| ('Student1', 'Art') | Student1 | Art | Absent | Fail |
| ('Student1', 'Art') | Student1 | Art | Absent | Fail |
| ('Student1', 'Art') | Student1 | Art | Fail | Fail |
| ('Student1', 'Computer Science') | Student1 | Computer Science | Fail | Fail |
| ('Student1', 'Computer Science') | Student1 | Computer Science | Fail | Fail |
| ('Student1', 'Computer Science') | Student1 | Computer Science | Pass | Fail |
| ('Student2', 'Math') | Student2 | Math | Pass | Pass |
| ('Student2', 'Math') | Student2 | Math | Pass | Pass |
| ('Student2', 'Math') | Student2 | Math | Pass | Pass |
| ('Student2', 'Chemistry') | Student2 | Chemistry | Fail | Fail |
| ('Student2', 'Chemistry') | Student2 | Chemistry | Pass | Fail |
| ('Student2', 'Chemistry') | Student2 | Chemistry | Absent | Fail |
| ('Student2', 'Art') | Student2 | Art | Pass | Fail |
| ('Student2', 'Art') | Student2 | Art | Pass | Fail |
| ('Student2', 'Art') | Student2 | Art | Fail | Fail |
| ('Student2', 'Computer Science') | Student2 | Computer Science | Pass | Pass |
| ('Student2', 'Computer Science') | Student2 | Computer Science | Pass | Pass |
| ('Student2', 'Computer Science') | Student2 | Computer Science | Pass | Pass |
Approach 2: Using a lambda instead. This works faster due to the lack of multiple loc commands inside the function, but still does not use vectorization and is quite slow.
Set course_status based on rule statuses for that audit
def set_course_status(i):
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: return 'Fail'
elif 'Absent' in status_list: return 'Absent'
elif 'Pass' in status_list: return 'Pass'
df['course_status'] = df.groupby(df.index).apply(lambda x : set_course_status(x.name))
The results are the same as above.
Can anyone please come up with a way to do this using vectorization or some really fast method?
</details>
# 答案1
**得分**: 1
以下是翻译好的代码部分:
```python
import pandas as pd
d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]
df = pd.DataFrame(d)
df.course_status = ''
# 创建一个元组,以便学生和课程可以作为一个索引跟踪:
df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)
# 将元组设置为索引(希望能获得一些性能提升)
df = df.set_index(['name_course'], drop=True)
# 状态的顺序
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# 将状态映射为数字
df["course_status"] = df.exam_status.map(status_map)
# 获取每个name_course的最小course_status
df["course_status"] = df.groupby([
"name_course",
]).course_status.transform("min").map(
# 将映射回原始状态名称
{v: k for k, v in status_map.items()}
)
print(df)
# 时间比较通过%%timeit
# 你的第一次尝试
%%timeit
for i in df.index.unique():
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: df.loc[[i], 'course_status'] = 'Fail'
elif 'Absent' in status_list: df.loc[[i], 'course_status'] = 'Absent'
elif 'Pass' or 'Skip' in status_list: df.loc[[i], 'course_status'] = 'Pass'
# 你的第二次尝试
%%timeit
# 根据规则为该审计设置course_status
def set_course_status(i):
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: return 'Fail'
elif 'Absent' in status_list: return 'Absent'
elif 'Pass' in status_list: return 'Pass'
df['course_status'] = df.groupby(df.index).apply(lambda x : set_course_status(x.name))
# 我的尝试
%%timeit
# 状态的顺序
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# 将状态映射为数字
df["course_status"] = df.exam_status.map(status_map)
# 获取每个name_course的最小course_status
df["course_status"] = df.groupby([
"name_course",
]).course_status.transform("min").map(
# 将映射回原始状态名称
{v: k for k, v in status_map.items()}
)
# 在10,000倍的行上进行测试
# 240_000行而不是24行
big = pd.concat((df for _ in range(10_000)))
# 注意:我将`df`替换为`big`
%%timeit
# 状态的顺序
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# 将状态映射为数字
big["course_status"] = big.exam_status.map(status_map)
# 获取每个name_course的最小course_status
big["course_status"] = big.groupby([
"name_course",
]).course_status.transform("min").map(
# 将映射回原始状态名称
{v: k for k, v in status_map.items()}
)
英文:
How's this?
import pandas as pd
d = [{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Math', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Absent'},
{'name':'Student1', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Fail'},
{'name':'Student1', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Math', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Chemistry', 'exam_status':'Absent'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Art', 'exam_status':'Fail'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},
{'name':'Student2', 'course':'Computer Science', 'exam_status':'Pass'},]
df = pd.DataFrame(d)
df.course_status = ''
# Create a tuple so the student and course can be tracked as one index:
df['name_course'] = df[['name', 'course']].apply(tuple, axis=1)
# Set the tuple as the index (hoping to get a bit of performance boost)
df = df.set_index(['name_course'],drop=True)
# ordering of statuses
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# map statuses to numeric
df["course_status"] = df.exam_status.map(status_map)
# get the min course_status per name_course
df["course_status"] = df.groupby([
"name_course",
]).course_status.transform("min").map(
# map back to original status names
{v: k for k, v in status_map.items()}
)
print(df)
name course exam_status course_status
name_course
(Student1, Math) Student1 Math Pass Fail
(Student1, Math) Student1 Math Fail Fail
(Student1, Math) Student1 Math Pass Fail
(Student1, Chemistry) Student1 Chemistry Pass Pass
(Student1, Chemistry) Student1 Chemistry Pass Pass
(Student1, Chemistry) Student1 Chemistry Pass Pass
(Student1, Art) Student1 Art Absent Fail
(Student1, Art) Student1 Art Absent Fail
(Student1, Art) Student1 Art Fail Fail
(Student1, Computer Science) Student1 Computer Science Fail Fail
(Student1, Computer Science) Student1 Computer Science Fail Fail
(Student1, Computer Science) Student1 Computer Science Pass Fail
(Student2, Math) Student2 Math Pass Pass
(Student2, Math) Student2 Math Pass Pass
(Student2, Math) Student2 Math Pass Pass
(Student2, Chemistry) Student2 Chemistry Fail Fail
(Student2, Chemistry) Student2 Chemistry Pass Fail
(Student2, Chemistry) Student2 Chemistry Absent Fail
(Student2, Art) Student2 Art Pass Fail
(Student2, Art) Student2 Art Pass Fail
(Student2, Art) Student2 Art Fail Fail
(Student2, Computer Science) Student2 Computer Science Pass Pass
(Student2, Computer Science) Student2 Computer Science Pass Pass
(Student2, Computer Science) Student2 Computer Science Pass Pass
Time comparisons via %%timeit
# your first attempt
%%timeit
for i in df.index.unique():
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: df.loc[[i], 'course_status'] = 'Fail'
elif 'Absent' in status_list: df.loc[[i], 'course_status'] = 'Absent'
elif 'Pass' or 'Skip' in status_list: df.loc[[i], 'course_status'] = 'Pass'
8.62 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# your second attempt
%%timeit
# Set course_status based on rule statuses for that audit
def set_course_status(i):
status_list = list(df.loc[[i]]['exam_status'])
if 'Fail' in status_list: return 'Fail'
elif 'Absent' in status_list: return 'Absent'
elif 'Pass' in status_list: return 'Pass'
df['course_status'] = df.groupby(df.index).apply(lambda x : set_course_status(x.name))
6.79 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# my attempt
%%timeit
# ordering of statuses
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# map statuses to numeric
df["course_status"] = df.exam_status.map(status_map)
# get the min course_status per name_course
df["course_status"] = df.groupby([
"name_course",
]).course_status.transform("min").map(
# map back to original status names
{v: k for k, v in status_map.items()}
)
1.99 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Testing on 10,000x the rows
# 240_000 rows instead of 24
big = pd.concat((df for _ in range(10_000)))
# NOTE: I swapped `df` for `big`
%%timeit
# ordering of statuses
status_map = {"Fail": 0, "Absent": 1, "Pass": 2}
# map statuses to numeric
big["course_status"] = big.exam_status.map(status_map)
# get the min course_status per name_course
big["course_status"] = big.groupby([
"name_course",
]).course_status.transform("min").map(
# map back to original status names
{v: k for k, v in status_map.items()}
)
33.7 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案2
得分: 1
def addCourseStatus(df):
has = pd.DataFrame({'absent': df.exam_status == 'Absent', 'fail': df.exam_status == 'Fail', 'pas': df.exam_status == 'Pass'}).groupby(level=0).any()
df.loc[has.fail, 'course_status'] = 'Fail'
df.loc[~has.fail & has.absent, 'course_status'] = 'Absent'
df.loc[~has.fail & ~has.absent & has.pas, 'course_status'] = 'Pass'
英文:
def addCourseStatus(df):
has=pd.DataFrame({'absent':df.exam_status=='Absent', 'fail':df.exam_status=='Fail', 'pas':df.exam_status=='Pass'}).groupby(level=0).any()
df.loc[has.fail,'course_status']='Fail'
df.loc[~has.fail & has.absent, 'course_status']='Absent'
df.loc[~has.fail & ~has.absent & has.pas, 'course_status']='Pass'
Probably not optimal. I don't like the creation of a temporary dataframe. But vectorized.
It is faster, but barely (some 30% faster) than your two solutions for your example. Because it is vectorized indeed, but that are a few of those vectorized operations (3 comparisons, then the creation of the dataframe, then 3 booleans/loc operations).
But the more rows there will be, the more the vectorization will show, and the less those 7 operations will matter.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论