英文:
Comparing Two Dataframes to See if One Sits Inside the Range of the Other's columns
问题
我有两个如下的数据框:
```python
import pandas as pd
import itertools
df_test = {
"col1": [1, 5, 6, 5],
"col2": [4, 5, 6, 7],
"col3": [5, 9, 10, 11],
"col4": [11, 10, 10, 4],
}
df_train = {
"col1": [3, 6, 7, 4, 4],
"col2": [5, 10, 5, 5, 5],
"col3": [6, 8, 9, 10, 6],
"col4": [11, 10, 8, 12, 6],
}
df_train = pd.DataFrame(data=df_train)
df_test = pd.DataFrame(data=df_test)
我已经找到了df_train的最小值和最大值,想要查看df_test是否在这些值范围内,我想遍历每一行,并根据训练集的每列给出的最小值和最大值不断地更改。我不断地比较df_test行中的每个值,看它是否在最小值和最大值内。如果是,则将True插入到一个列表中,否则将插入False。
# 寻找最小最大值
df_train_minMax = df_train.agg([min, max])
# 确定测试数据的列数和行数
columns = len(df_test.columns)
rows = len(df_test.index)
df_train_minMax看起来像:
col1 col2 col3 col4
min 3 5 6 6
max 7 10 10 12
我创建了两个列表 - list接收df_test每行的True或False值。list2接收False值的索引位置。然后我创建了一个循环来遍历df_test,尽管它变得很难阅读。我确信有一种更好更简单的方法来做到这一点。
list = []
list2 = []
def inRange(min, max):
x = 0
# i = index, j = columns
for i, j in itertools.product(range(rows), range(columns)):
if df_test.iloc[i, j] >= min and df_test.iloc[i, j] <= max:
x = x + 1
else:
x = x + 0
list2.append((i, j))
if x == columns:
list.insert(i, "True")
else:
list.insert(i, "False")
# 调用函数需要根据df_train_minMax中每列的最小/最大值不断更新数值
inRange(5, 10)
print(list)
print(list2)
预期输出:
list = [False, True, True, False]
list2 = [(0, 0), (0, 1), (0, 3), (3, 2), (3, 3)] 或类似
调用函数应该根据df_train_minMax中的列不断更新最小值和最大值,这是我正在努力编码的部分。
英文:
I have 2 dataframes as follows:
import pandas as pd
import itertools
df_test = {
"col1": [1, 5, 6, 5],
"col2": [4, 5, 6, 7],
"col3": [5, 9, 10, 11],
"col4": [11, 10, 10, 4],
}
df_train = {
"col1": [3, 6, 7, 4, 4],
"col2": [5, 10, 5, 5, 5],
"col3": [6, 8, 9, 10, 6],
"col4": [11, 10, 8, 12, 6],
}
df_train = pd.DataFrame(data=df_train)
df_test = pd.DataFrame(data=df_test)
I have already found the min and max values of df_train and want to see if df_test sits within these values, I want to iterate through each row, continually changing the min and max values as given for each column of the training set. I continually compare each value in the df_test row to see if it fits inside the min and max. If yes, then i will insert True to a list, else I'll write False.
# find minMax
df_train_minMax = df_train.agg([min, max])
# identify no of columns and rows in test data
columns = len(df_test.columns)
rows = len(df_test.index)
df_train_minMax looks like:
col1 col2 col3 col4
min 3 5 6 6
max 7 10 10 12
I have made 2 lists - list receives True or False values per row of df_test. list2 receives the index location of the False values. And then I've made a loop to iterate through df_test, although its becoming very complicated to read. I am certain there is a better and simpler way of doing this.
list = []
list2 = []
def inRange(min, max):
x = 0
# i = index, j = columns
for i, j in itertools.product(range(rows), range(columns)):
if df_test.iloc[i, j] >= min and df_test.iloc[i, j] <= max:
x = x + 1
else:
x = x + 0
list2.append((i, j))
# print(x)
if x == columns:
list.insert(i, "True")
else:
list.insert(i, "False")
# calling function would need the values to continually change per min/max of each column
inRange(5, 10)
print(list)
print(list2)
Expected:
list = [False, True, True, False]
list2 = [(0, 0), (0, 1), (0, 3), (3, 2), (3, 3)] or similar
calling function should continually update the min and max values in accordance with the columns in df_train_minMax, which is what I'm struggling to code.
答案1
得分: 1
>>> df_mask = df_test.mask(lambda x: (x>=5) & (x<=10))
>>> df_mask.values.tolist()
[[1.0, 4.0, nan, 11.0], [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, 11.0, 4.0]]
在这一步中,遍历每个元素,如果只包含“True”值,表示整行都符合条件(在最小值和最大值之间,即5和10),则插入“True”。
>>> min_num, max_num = 5, 10
>>> df_mask = df_test.mask(lambda x: (x>=min_num) & (x<=max_num))
>>> df_mask
col1 col2 col3 col4
0 1.0 4.0 NaN 11.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN 11.0 4.0
>>> df_mask = df_mask.isnull()
>>> df_mask
col1 col2 col3 col4
0 False False True False
1 True True True True
2 True True True True
3 True True False False
>>> df_mask = df_mask.values.tolist()
>>> df_mask
[[False, False, True, False], [True, True, True, True], [True, True, True, True], [True, True, False, False]]
# ...
>>> list2 = ["True" if (len(set(i)) == 1 and list(set(i))[0] == True) else "False" for i in df_mask]
>>> list2
['False', 'True', 'True', 'False']
<br>
# 编辑:
添加了inRange函数并修改了参数
```python
def inRange(min_num, max_num, df):
# 只有True值符合条件 min < x > max
# (可选)使用列表值创建列
return df.mask(lambda x: (x>=min_num) & (x<=max_num)).isnull()
for col in df_train.tolist():
min_num = min(col)
max_num = max(col)
df1 = inRange(min_num, max_num, df_test)
print([["True" if (len(set(i)) == 1 and list(set(i))[0] == True) else "False" for i in df1.tolist()])
英文:
Replace values of each row using mask
and setting as conditional numbers less or equal and greater or equal than x that is the value in a dataframe you iterate
>>> df_mask = df_test.mask(lambda x: (x>=5) & (x<=10))
>>> df_mask.values.tolist()
[[1.0, 4.0, nan, 11.0], [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, 11.0, 4.0]]
At this point pass through each element and if just contains True
values means that you insert "True"
because that entire row match for your condition between min and max numbers (5, 10)
>>> min_num, max_num = 5, 10
>>> df_mask = df_test.mask(lambda x: (x>=min_num) & (x<=max_num))
>>> df_mask
col1 col2 col3 col4
0 1.0 4.0 NaN 11.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN 11.0 4.0
>>> df_mask = df_mask.isnull()
>>> df_mask
col1 col2 col3 col4
0 False False True False
1 True True True True
2 True True True True
3 True True False False
>>> df_mask = df_mask.values.tolist()
>>> df_mask
[[False, False, True, False], [True, True, True, True], [True, True, True, True], [True, True, False, False]]
# ...
>>> list2 = ["True" if (len(set(i)) == 1 and list(set(i))[0] == True) else "False" for i in df_mask]
>>> list2
['False', 'True', 'True', 'False']
<br>
EDIT:
added inRange function and modified args
def inRange(min_num, max_num, df):
# only True values match condition min < x > max
# (optional) create colum using list values
return df.mask(lambda x: (x>=min_num) & (x<=max_num)).isnull()
for col in df_train.tolist():
min_num = min(col)
max_num = max(col)
df1 = inRange(min_num, max_num, df_test)
print([["True" if (len(set(i)) == 1 and list(set(i))[0] == True) else "False" for i in df1.tolist()])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论