英文:
Filter data frame based on absolute difference
问题
以下是你提供的代码的翻译部分:
我有以下数据框:
```python
import pandas as pd
d1 = {'id': ["car", "car", "car", "plane", "plane", "car"], 'value': [1, 1.2, 5, 6, 1.3, 0.8]}
df1 = pd.DataFrame(data=d1)
df1
id value
0 car 1.0
1 car 1.2
2 car 5.0
3 plane 6.0
4 plane 1.3
5 car 0.8
我想要过滤掉行,如果值的所有差异都小于1,那么我会得到以下数据框:
d2 = {'id': ["car", "car", "car"], 'value': [1, 1.2, 0.8]}
df2 = pd.DataFrame(data=d2)
df2
id value
0 car 1.0
1 car 1.2
5 car 0.8
和
d3 = {'id': ["car", "plane", "plane"], 'value': [5, 6, 1.3]}
df3 = pd.DataFrame(data=d3)
df3
2 car 5.0
3 plane 6.0
4 plane 1.3
我尝试了以下函数来将所有值保存在一个临时列表中,但它没有正常工作:
unique_list = []
def unique_2(df):
for id_1, value_1 in zip(df["id"], df["value"]):
for id_2, value_2 in zip(df["id"], df["value"]):
if id_1 == id_2:
if abs(value_1-value_2) > 0.01:
x = True
unique_list.append(x)
else:
x = False
unique_list.append(x)
else:
pass
<details>
<summary>英文:</summary>
I have the following data frame:
```python
import pandas as pd
d1 = {'id': ["car", "car", "car", "plane", "plane", "car"], 'value': [1, 1.2, 5, 6, 1.3, 0.8]}
df1 = pd.DataFrame(data=d1)
df1
id value
0 car 1.0
1 car 1.2
2 car 5.0
3 plane 6.0
4 plane 1.3
5 car 0.8
I want to filter rows out, if all differences for a value are smaller than 1, so I get the following data frames:
d2 = {'id': ["car", "car", "car"], 'value': [1, 1.2, 0.8]}
df2 = pd.DataFrame(data=d2)
df2
id value
0 car 1.0
1 car 1.2
5 car 0.8
and
d3 = {'id': ["car", "plane", "plane"], 'value': [5, 6, 1.3]}
df3 = pd.DataFrame(data=d3)
df3
2 car 5.0
3 plane 6.0
4 plane 1.3
I tried the following function to save all values in a temporary list, but it did not work properly:
unique_list = []
def unique_2(df):
for id_1, value_1 in zip(df["id"], df["value"]):
for id_2, value_2 in zip(df["id"], df["value"]):
if id_1 == id_2:
if abs(value_1-value_2) > 0.01:
x = True
unique_list.append(x)
else:
x = False
unique_list.append(x)
else:
pass
答案1
得分: 2
以下是代码部分的中文翻译:
你可以使用自定义的 groupby
来拆分数据:
grp = df1['value'].sort_values().diff().gt(1).cumsum()
out = [g for _, g in df1.groupby(grp)]
请注意,不清楚你是否想要使用 <1
还是 ≤1
作为阈值。如果你想要 <1
,请将 gt(1)
替换为 ge(1)
。
输出:
[ id value
0 car 1.0
1 car 1.2
5 car 0.8,
id value
2 car 5.0
3 plane 6.0,
id value
4 plane 16.0]
中间的 grp
:
5 0
0 0
1 0
2 1
3 1
4 2
Name: value, dtype: int64
将单独的行分组在一起
如果你有不同的解释,如果你想将单独的行(即没有其他行与其相隔不超过1)分组在一起,可以使用以下代码:
grp = df1['value'].sort_values().diff().ge(1).cumsum()
grp = grp.mask(df1.groupby(grp).transform('size').eq(1), 'alone')
out = [g for _, g in df1.groupby(grp)]
请注意,我们只将相隔不超过1的行分组在一起。
输出:
[ id value
0 car 1.0
1 car 1.2
5 car 0.8,
id value
2 car 5.0
3 plane 6.0
4 plane 16.0]
中间的 grp
:
5 0
0 0
1 0
2 alone
3 alone
4 alone
Name: value, dtype: object
按ID分组:
grp = df1.sort_values(by='value').groupby('id', group_keys=False)['value'].apply(lambda g: g.diff().gt(1).cumsum())
grp = grp.mask(df1.groupby(['id', grp]).transform('size').eq(1), 'alone')
out = [g for _, g in df1.groupby(grp)]
输出:
[ id value
0 car 1.0
1 car 1.2
5 car 0.8,
id value
2 car 5.0
3 plane 6.0
4 plane 1.3]
英文:
You can use a custom groupby
to split the data:
grp = df1['value'].sort_values().diff().gt(1).cumsum()
out = [g for _, g in df1.groupby(grp)]
Note that it wasn't clear whether you want to use <1
or ≤1
as threshold. If you want <1
replace gt(1)
by ge(1)
.
Output:
[ id value
0 car 1.0
1 car 1.2
5 car 0.8,
id value
2 car 5.0
3 plane 6.0,
id value
4 plane 16.0]
Intermediate grp
:
5 0
0 0
1 0
2 1
3 1
4 2
Name: value, dtype: int64
grouping loners together
Assuming a different interpretation, if you want to groups loners (=rows that have no other row within 1) together, use:
grp = df1['value'].sort_values().diff().ge(1).cumsum()
grp = grp.mask(df1.groupby(grp).transform('size').eq(1), 'alone')
out = [g for _, g in df1.groupby(grp)]
Note that we're only grouping rows that are less than 1 apart
Output:
[ id value
0 car 1.0
1 car 1.2
5 car 0.8,
id value
2 car 5.0
3 plane 6.0
4 plane 16.0]
Intermediate grp
:
5 0
0 0
1 0
2 alone
3 alone
4 alone
Name: value, dtype: object
by ID:
grp = df1.sort_values(by='value').groupby('id', group_keys=False)['value'].apply(lambda g: g.diff().gt(1).cumsum())
grp = grp.mask(df1.groupby(['id', grp]).transform('size').eq(1), 'alone')
out = [g for _, g in df1.groupby(grp)]
Output:
[ id value
0 car 1.0
1 car 1.2
5 car 0.8,
id value
2 car 5.0
3 plane 6.0
4 plane 1.3]
答案2
得分: 1
以下是代码部分的翻译:
# 使用numpy广播获取列`value`的差异,获取绝对值,与1进行比较,将对角线上的值设置为False:
a = df1['value'].to_numpy()
m = np.abs(a - a[:, None]) < 1
np.fill_diagonal(m, False)
print (m)
[[False True False False False True]
[ True False False False False True]
[False False False False False False]
[False False False False False False]
[False False False False False False]
[ True True False False False False]]
# 最后根据每行至少有一个`True`来过滤行:
mask = np.any(m, axis=1)
df11, df22 = df1[mask], df1[~mask]
print (df11)
id value
0 car 1.0
1 car 1.2
5 car 0.8
print (df22)
id value
2 car 5.0
3 plane 6.0
4 plane 16.0
希望这些翻译对您有所帮助。如果您有其他问题或需要进一步的翻译,请随时告诉我。
英文:
Het differencies with column value
with numpy broadcasting, get absolute values, comapre less like 1
with set False
to diagonal:
a = df1['value'].to_numpy()
m = np.abs(a - a[:, None]) < 1
np.fill_diagonal(m, False)
print (m)
[[False True False False False True]
[ True False False False False True]
[False False False False False False]
[False False False False False False]
[False False False False False False]
[ True True False False False False]]
Last fitler rows with at least one True
per rows:
mask = np.any(m, axis=1)
df11, df22 = df1[mask], df1[~mask]
print (df11)
id value
0 car 1.0
1 car 1.2
5 car 0.8
print (df22)
id value
2 car 5.0
3 plane 6.0
4 plane 16.0
答案3
得分: 1
我无法提供解决方案,因为逻辑不清楚:
我想要过滤行,如果一个值的所有差异都小于1
def debug(sr):
print(f'[{sr.name}]')
arr = sr.values
val = np.abs(sr.values - sr.values[:, None])
print(pd.DataFrame(val, sr.tolist(), sr.tolist()))
print()
return np.max(val)
df1.groupby('id')['value'].transform(debug)
输出:
[car]
1.0 1.2 5.0 0.8
1.0 0.0 0.2 4.0 0.2 # 一个差异 > 1
1.2 0.2 0.0 3.8 0.4 # 一个差异 > 1
5.0 4.0 3.8 0.0 4.2 # 3个差异 > 1
0.8 0.2 0.4 4.2 0.0 # 1个差异 > 1
[plane]
6.0 16.0
6.0 0.0 10.0
16.0 10.0 0.0
0 4.2 # car组的差异 > 1
1 4.2
2 4.2
3 10.0 # plane组的差异 > 1
4 10.0
5 4.2
Name: value, dtype: float64
如您所见,对于每种组合,至少有一个值的差异大于1。因此,对于给定的组,您无法将其分成两部分。您只能将整个组设置为两个列表中的一个:
- 组1:所有绝对差异小于等于1
- 组2:至少有一个差异大于1
英文:
I have no solution because the logic is unclear:
> I want to filter rows out, if all differences for a value are smaller than 1
def debug(sr):
print(f'[{sr.name}]')
arr = sr.values
val = np.abs(sr.values - sr.values[:, None])
print(pd.DataFrame(val, sr.tolist(), sr.tolist()))
print()
return np.max(val)
df1.groupby('id')['value'].transform(debug)
Output:
[car]
1.0 1.2 5.0 0.8
1.0 0.0 0.2 4.0 0.2 # one difference > 1
1.2 0.2 0.0 3.8 0.4 # one difference > 1
5.0 4.0 3.8 0.0 4.2 # 3 differences > 1
0.8 0.2 0.4 4.2 0.0 # 1 difference > 1
[plane]
6.0 16.0
6.0 0.0 10.0
16.0 10.0 0.0
0 4.2 # car group difference > 1
1 4.2
2 4.2
3 10.0 # plane group difference > 1
4 10.0
5 4.2
Name: value, dtype: float64
As you can see, for each combination, there is at least one value whose difference is greater than 1. So for a given group, you can't split it into two parts. You can only set the whole group to one of the two lists:
- group1: all absolute differences is lower or equals than 1
- group2: at least one difference is greater than 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论