根据绝对差异筛选数据框。

huangapple go评论109阅读模式
英文:

Filter data frame based on absolute difference

问题

以下是你提供的代码的翻译部分:

  1. 我有以下数据框
  2. ```python
  3. import pandas as pd
  4. d1 = {'id': ["car", "car", "car", "plane", "plane", "car"], 'value': [1, 1.2, 5, 6, 1.3, 0.8]}
  5. df1 = pd.DataFrame(data=d1)
  6. df1
  1. id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 2 car 5.0
  5. 3 plane 6.0
  6. 4 plane 1.3
  7. 5 car 0.8

我想要过滤掉行,如果值的所有差异都小于1,那么我会得到以下数据框:

  1. d2 = {'id': ["car", "car", "car"], 'value': [1, 1.2, 0.8]}
  2. df2 = pd.DataFrame(data=d2)
  3. df2
  1. id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 5 car 0.8

  1. d3 = {'id': ["car", "plane", "plane"], 'value': [5, 6, 1.3]}
  2. df3 = pd.DataFrame(data=d3)
  3. df3
  1. 2 car 5.0
  2. 3 plane 6.0
  3. 4 plane 1.3

我尝试了以下函数来将所有值保存在一个临时列表中,但它没有正常工作:

  1. unique_list = []
  2. def unique_2(df):
  3. for id_1, value_1 in zip(df["id"], df["value"]):
  4. for id_2, value_2 in zip(df["id"], df["value"]):
  5. if id_1 == id_2:
  6. if abs(value_1-value_2) > 0.01:
  7. x = True
  8. unique_list.append(x)
  9. else:
  10. x = False
  11. unique_list.append(x)
  12. else:
  13. pass
  1. <details>
  2. <summary>英文:</summary>
  3. I have the following data frame:
  4. ```python
  5. import pandas as pd
  6. d1 = {&#39;id&#39;: [&quot;car&quot;, &quot;car&quot;, &quot;car&quot;, &quot;plane&quot;, &quot;plane&quot;, &quot;car&quot;], &#39;value&#39;: [1, 1.2, 5, 6, 1.3, 0.8]}
  7. df1 = pd.DataFrame(data=d1)
  8. df1
  1. id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 2 car 5.0
  5. 3 plane 6.0
  6. 4 plane 1.3
  7. 5 car 0.8

I want to filter rows out, if all differences for a value are smaller than 1, so I get the following data frames:

  1. d2 = {&#39;id&#39;: [&quot;car&quot;, &quot;car&quot;, &quot;car&quot;], &#39;value&#39;: [1, 1.2, 0.8]}
  2. df2 = pd.DataFrame(data=d2)
  3. df2
  1. id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 5 car 0.8

and

  1. d3 = {&#39;id&#39;: [&quot;car&quot;, &quot;plane&quot;, &quot;plane&quot;], &#39;value&#39;: [5, 6, 1.3]}
  2. df3 = pd.DataFrame(data=d3)
  3. df3
  1. 2 car 5.0
  2. 3 plane 6.0
  3. 4 plane 1.3

I tried the following function to save all values in a temporary list, but it did not work properly:

  1. unique_list = []
  2. def unique_2(df):
  3. for id_1, value_1 in zip(df[&quot;id&quot;], df[&quot;value&quot;]):
  4. for id_2, value_2 in zip(df[&quot;id&quot;], df[&quot;value&quot;]):
  5. if id_1 == id_2:
  6. if abs(value_1-value_2) &gt; 0.01:
  7. x = True
  8. unique_list.append(x)
  9. else:
  10. x = False
  11. unique_list.append(x)
  12. else:
  13. pass

答案1

得分: 2

以下是代码部分的中文翻译:

你可以使用自定义的 groupby 来拆分数据:

  1. grp = df1['value'].sort_values().diff().gt(1).cumsum()
  2. out = [g for _, g in df1.groupby(grp)]

请注意,不清楚你是否想要使用 <1 还是 ≤1 作为阈值。如果你想要 <1,请将 gt(1) 替换为 ge(1)

输出:

  1. [ id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 5 car 0.8,
  5. id value
  6. 2 car 5.0
  7. 3 plane 6.0,
  8. id value
  9. 4 plane 16.0]

中间的 grp

  1. 5 0
  2. 0 0
  3. 1 0
  4. 2 1
  5. 3 1
  6. 4 2
  7. Name: value, dtype: int64

将单独的行分组在一起

如果你有不同的解释,如果你想将单独的行(即没有其他行与其相隔不超过1)分组在一起,可以使用以下代码:

  1. grp = df1['value'].sort_values().diff().ge(1).cumsum()
  2. grp = grp.mask(df1.groupby(grp).transform('size').eq(1), 'alone')
  3. out = [g for _, g in df1.groupby(grp)]

请注意,我们只将相隔不超过1的行分组在一起。

输出:

  1. [ id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 5 car 0.8,
  5. id value
  6. 2 car 5.0
  7. 3 plane 6.0
  8. 4 plane 16.0]

中间的 grp

  1. 5 0
  2. 0 0
  3. 1 0
  4. 2 alone
  5. 3 alone
  6. 4 alone
  7. Name: value, dtype: object
按ID分组:
  1. grp = df1.sort_values(by='value').groupby('id', group_keys=False)['value'].apply(lambda g: g.diff().gt(1).cumsum())
  2. grp = grp.mask(df1.groupby(['id', grp]).transform('size').eq(1), 'alone')
  3. out = [g for _, g in df1.groupby(grp)]

输出:

  1. [ id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 5 car 0.8,
  5. id value
  6. 2 car 5.0
  7. 3 plane 6.0
  8. 4 plane 1.3]
英文:

You can use a custom groupby to split the data:

  1. grp = df1[&#39;value&#39;].sort_values().diff().gt(1).cumsum()
  2. out = [g for _, g in df1.groupby(grp)]

Note that it wasn't clear whether you want to use &lt;1 or ≤1 as threshold. If you want &lt;1 replace gt(1) by ge(1).

Output:

  1. [ id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 5 car 0.8,
  5. id value
  6. 2 car 5.0
  7. 3 plane 6.0,
  8. id value
  9. 4 plane 16.0]

Intermediate grp:

  1. 5 0
  2. 0 0
  3. 1 0
  4. 2 1
  5. 3 1
  6. 4 2
  7. Name: value, dtype: int64

grouping loners together

Assuming a different interpretation, if you want to groups loners (=rows that have no other row within 1) together, use:

  1. grp = df1[&#39;value&#39;].sort_values().diff().ge(1).cumsum()
  2. grp = grp.mask(df1.groupby(grp).transform(&#39;size&#39;).eq(1), &#39;alone&#39;)
  3. out = [g for _, g in df1.groupby(grp)]

Note that we're only grouping rows that are less than 1 apart

Output:

  1. [ id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 5 car 0.8,
  5. id value
  6. 2 car 5.0
  7. 3 plane 6.0
  8. 4 plane 16.0]

Intermediate grp:

  1. 5 0
  2. 0 0
  3. 1 0
  4. 2 alone
  5. 3 alone
  6. 4 alone
  7. Name: value, dtype: object
by ID:
  1. grp = df1.sort_values(by=&#39;value&#39;).groupby(&#39;id&#39;, group_keys=False)[&#39;value&#39;].apply(lambda g: g.diff().gt(1).cumsum())
  2. grp = grp.mask(df1.groupby([&#39;id&#39;, grp]).transform(&#39;size&#39;).eq(1), &#39;alone&#39;)
  3. out = [g for _, g in df1.groupby(grp)]

Output:

  1. [ id value
  2. 0 car 1.0
  3. 1 car 1.2
  4. 5 car 0.8,
  5. id value
  6. 2 car 5.0
  7. 3 plane 6.0
  8. 4 plane 1.3]

答案2

得分: 1

以下是代码部分的翻译:

  1. # 使用numpy广播获取列`value`的差异,获取绝对值,与1进行比较,将对角线上的值设置为False:
  2. a = df1['value'].to_numpy()
  3. m = np.abs(a - a[:, None]) < 1
  4. np.fill_diagonal(m, False)
  5. print (m)
  6. [[False True False False False True]
  7. [ True False False False False True]
  8. [False False False False False False]
  9. [False False False False False False]
  10. [False False False False False False]
  11. [ True True False False False False]]
  12. # 最后根据每行至少有一个`True`来过滤行:
  13. mask = np.any(m, axis=1)
  14. df11, df22 = df1[mask], df1[~mask]
  15. print (df11)
  16. id value
  17. 0 car 1.0
  18. 1 car 1.2
  19. 5 car 0.8
  20. print (df22)
  21. id value
  22. 2 car 5.0
  23. 3 plane 6.0
  24. 4 plane 16.0

希望这些翻译对您有所帮助。如果您有其他问题或需要进一步的翻译,请随时告诉我。

英文:

Het differencies with column value with numpy broadcasting, get absolute values, comapre less like 1 with set False to diagonal:

  1. a = df1[&#39;value&#39;].to_numpy()
  2. m = np.abs(a - a[:, None]) &lt; 1
  3. np.fill_diagonal(m, False)
  4. print (m)
  5. [[False True False False False True]
  6. [ True False False False False True]
  7. [False False False False False False]
  8. [False False False False False False]
  9. [False False False False False False]
  10. [ True True False False False False]]

Last fitler rows with at least one True per rows:

  1. mask = np.any(m, axis=1)
  2. df11, df22 = df1[mask], df1[~mask]
  3. print (df11)
  4. id value
  5. 0 car 1.0
  6. 1 car 1.2
  7. 5 car 0.8
  8. print (df22)
  9. id value
  10. 2 car 5.0
  11. 3 plane 6.0
  12. 4 plane 16.0

答案3

得分: 1

我无法提供解决方案,因为逻辑不清楚:

我想要过滤行,如果一个值的所有差异都小于1

  1. def debug(sr):
  2. print(f'[{sr.name}]')
  3. arr = sr.values
  4. val = np.abs(sr.values - sr.values[:, None])
  5. print(pd.DataFrame(val, sr.tolist(), sr.tolist()))
  6. print()
  7. return np.max(val)
  8. df1.groupby('id')['value'].transform(debug)

输出:

  1. [car]
  2. 1.0 1.2 5.0 0.8
  3. 1.0 0.0 0.2 4.0 0.2 # 一个差异 > 1
  4. 1.2 0.2 0.0 3.8 0.4 # 一个差异 > 1
  5. 5.0 4.0 3.8 0.0 4.2 # 3个差异 > 1
  6. 0.8 0.2 0.4 4.2 0.0 # 1个差异 > 1
  7. [plane]
  8. 6.0 16.0
  9. 6.0 0.0 10.0
  10. 16.0 10.0 0.0
  11. 0 4.2 # car组的差异 > 1
  12. 1 4.2
  13. 2 4.2
  14. 3 10.0 # plane组的差异 > 1
  15. 4 10.0
  16. 5 4.2
  17. Name: value, dtype: float64

如您所见,对于每种组合,至少有一个值的差异大于1。因此,对于给定的组,您无法将其分成两部分。您只能将整个组设置为两个列表中的一个:

  • 组1:所有绝对差异小于等于1
  • 组2:至少有一个差异大于1
英文:

I have no solution because the logic is unclear:

> I want to filter rows out, if all differences for a value are smaller than 1

  1. def debug(sr):
  2. print(f&#39;[{sr.name}]&#39;)
  3. arr = sr.values
  4. val = np.abs(sr.values - sr.values[:, None])
  5. print(pd.DataFrame(val, sr.tolist(), sr.tolist()))
  6. print()
  7. return np.max(val)
  8. df1.groupby(&#39;id&#39;)[&#39;value&#39;].transform(debug)

Output:

  1. [car]
  2. 1.0 1.2 5.0 0.8
  3. 1.0 0.0 0.2 4.0 0.2 # one difference &gt; 1
  4. 1.2 0.2 0.0 3.8 0.4 # one difference &gt; 1
  5. 5.0 4.0 3.8 0.0 4.2 # 3 differences &gt; 1
  6. 0.8 0.2 0.4 4.2 0.0 # 1 difference &gt; 1
  7. [plane]
  8. 6.0 16.0
  9. 6.0 0.0 10.0
  10. 16.0 10.0 0.0
  11. 0 4.2 # car group difference &gt; 1
  12. 1 4.2
  13. 2 4.2
  14. 3 10.0 # plane group difference &gt; 1
  15. 4 10.0
  16. 5 4.2
  17. Name: value, dtype: float64

As you can see, for each combination, there is at least one value whose difference is greater than 1. So for a given group, you can't split it into two parts. You can only set the whole group to one of the two lists:

  • group1: all absolute differences is lower or equals than 1
  • group2: at least one difference is greater than 1

huangapple
  • 本文由 发表于 2023年2月27日 18:37:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75579387.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定