2023年6月15日 04:02:55go评论103阅读模式

英文:

filter a multiindex dataframe bassed on condition from another dataframe

问题

你想要将代码部分翻译成中文吗？

英文:

I'd appreciate your help.

I have a multiindex dataframe like this:

df1 = {&#39;Sex&#39;: {(&#39;002_S_0413&#39;, 0, &#39;DTI&#39;): &#39;F&#39;,
  (&#39;002_S_0413&#39;, 0, &#39;T1&#39;): &#39;F&#39;,
  (&#39;002_S_4213&#39;, 2, &#39;DTI&#39;): &#39;F&#39;,
  (&#39;002_S_4213&#39;, 2, &#39;T1&#39;): &#39;F&#39;,
  (&#39;002_S_4799&#39;, 0, &#39;DTI&#39;): &#39;M&#39;,
  (&#39;002_S_4799&#39;, 0, &#39;T1&#39;): &#39;M&#39;,
  (&#39;002_S_5178&#39;, 0, &#39;DTI&#39;): &#39;M&#39;,
  (&#39;002_S_5178&#39;, 0, &#39;T1&#39;): &#39;M&#39;,
  (&#39;002_S_5230&#39;, 2, &#39;DTI&#39;): &#39;F&#39;,
  (&#39;002_S_5230&#39;, 2, &#39;T1&#39;): &#39;F&#39;},
 &#39;DIAGNOSIS&#39;: {(&#39;002_S_0413&#39;, 0, &#39;DTI&#39;): 1.0,
  (&#39;002_S_0413&#39;, 0, &#39;T1&#39;): 1.0,
  (&#39;002_S_4213&#39;, 2, &#39;DTI&#39;): 1.0,
  (&#39;002_S_4213&#39;, 2, &#39;T1&#39;): 1.0,
  (&#39;002_S_4799&#39;, 0, &#39;DTI&#39;): 1.0,
  (&#39;002_S_4799&#39;, 0, &#39;T1&#39;): 1.0,
  (&#39;002_S_5178&#39;, 0, &#39;DTI&#39;): 1.0,
  (&#39;002_S_5178&#39;, 0, &#39;T1&#39;): 1.0,
  (&#39;002_S_5230&#39;, 2, &#39;DTI&#39;): 1.0,
  (&#39;002_S_5230&#39;, 2, &#39;T1&#39;): 1.0}}

and a second dataframe:

df2 = {&#39;Subject ID&#39;: {0: &#39;002_S_0413&#39;,
  1: &#39;002_S_0413&#39;,
  2: &#39;002_S_4213&#39;,
  3: &#39;002_S_4213&#39;,
  4: &#39;002_S_4799&#39;,
  5: &#39;002_S_4799&#39;,
  6: &#39;002_S_4799&#39;,
  7: &#39;002_S_5178&#39;,
  8: &#39;002_S_5178&#39;,
  9: &#39;002_S_5230&#39;,
  10: &#39;002_S_5230&#39;,
  11: &#39;002_S_5230&#39;,
  12: &#39;002_S_6007&#39;,
  13: &#39;002_S_6007&#39;},
 &#39;Visit_NUM&#39;: {0: 0,
  1: 2,
  2: 0,
  3: 2,
  4: 0,
  5: 1,
  6: 2,
  7: 0,
  8: 2,
  9: 0,
  10: 1,
  11: 2,
  12: 0,
  13: 1}}

I want to filter df1:
if for each correspondant Subject ID (level=0) in both dataframes there is a Visit_NUM in df2 that is greater than or equal to 2 plus that in df1 (level=1), keep it (the subject's row in df1), if not delete it.

To clarify: for each Subject ID, if (Visit_NUM_in_df1) >= (2 + Visit_NUM_in_df2) keep that Subject's row in df1, if not delete it.

This is what I've done:

df3 = pd.DataFrame(df1.reset_index([
    &#39;Visit_NUM&#39;, &#39;Description&#39;]).groupby(
    level=0)[&#39;Visit_NUM&#39;].transform(lambda x: x + 2)).reset_index(
).drop_duplicates([&#39;Subject ID&#39;])
t = df3.merge(df2.reset_index(), on=[&#39;Subject ID&#39;, &#39;Visit_NUM&#39;])
t = t[&#39;Subject ID&#39;]
out = df1.loc[df1.index.get_level_values(&#39;Subject ID&#39;).isin(t)]

The result would be something like:

out = {&#39;Sex&#39;: {(&#39;002_S_0413&#39;, 0, &#39;DTI&#39;): &#39;F&#39;,
  (&#39;002_S_0413&#39;, 0, &#39;T1&#39;): &#39;F&#39;,
  (&#39;002_S_4799&#39;, 0, &#39;DTI&#39;): &#39;M&#39;,
  (&#39;002_S_4799&#39;, 0, &#39;T1&#39;): &#39;M&#39;,
  (&#39;002_S_5178&#39;, 0, &#39;DTI&#39;): &#39;M&#39;,
  (&#39;002_S_5178&#39;, 0, &#39;T1&#39;): &#39;M&#39;},
 &#39;DIAGNOSIS&#39;: {(&#39;002_S_0413&#39;, 0, &#39;DTI&#39;): 1.0,
  (&#39;002_S_0413&#39;, 0, &#39;T1&#39;): 1.0,
  (&#39;002_S_4799&#39;, 0, &#39;DTI&#39;): 1.0,
  (&#39;002_S_4799&#39;, 0, &#39;T1&#39;): 1.0,
  (&#39;002_S_5178&#39;, 0, &#39;DTI&#39;): 1.0,
  (&#39;002_S_5178&#39;, 0, &#39;T1&#39;): 1.0}}

By doing what I've done I'm just getting those that have Visit_NUM + 2, but not those that are >= (Visit_NUM + 2).

Also, I believe there is an easy way to do this, thanks!

答案1

得分: 0

尝试使用以下代码代替：

max_visit_num = df2.groupby('Subject ID')['Visit_NUM'].max()
out = df1.loc[df1.index.get_level_values(0).isin(max_visit_num[max_visit_num >= 2].index)]

英文:

Try this instead:

max_visit_num = df2.groupby(&#39;Subject ID&#39;)[&#39;Visit_NUM&#39;].max()
out = df1.loc[df1.index.get_level_values(0).isin(max_visit_num[max_visit_num &gt;= 2].index)]

答案2

得分: 0

以下是翻译好的代码部分：

list_subjects = np.unique(df1.reset_index()['Subject ID'])
tt=[]
for i in range(len(list_subjects)):
    name = list_subjects[i]
    visit = df2[df2['Subject ID'] == name]['Visit_NUM']
    visit_min = np.unique(df1.loc[name].reset_index()['Visit_NUM'])
    if any(visit >= visit_min[0]+2):
        tt.append(name)
        
out = df1.loc[df1.index.get_level_values('Subject ID').isin(tt)]

请注意，我只提供了代码的翻译，没有包括任何其他内容。

英文:

A friend has helped me with this answer:

list_subjects = np.unique(df1.reset_index()[&#39;Subject ID&#39;])
tt=[]
for i in range(len(list_subjects)):
    name = list_subjects[i]
    visit = df2[df2[&#39;Subject ID&#39;] == name][&#39;Visit_NUM&#39;]
    visit_min = np.unique(df1.loc[name].reset_index()[&#39;Visit_NUM&#39;])
    if any(visit &gt;= visit_min[0]+2):
        tt.append(name)
        
out = df1.loc[df1.index.get_level_values(&#39;Subject ID&#39;).isin(tt)]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

基于另一个数据框架的条件筛选多级索引数据框。

问题

答案1

答案2

PyPDF2如何读取PDF页面的正确大小

在Python（Numpy）中如何强制至少N个数字？

CodeHS 8.3.8: Word Ladder 无法通过自动评分器。

如何让numpy的clip函数运行更快？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。