比较两个数据框,看一个数据框的列是否在另一个数据框的范围内。

huangapple go评论77阅读模式
英文:

Comparing Two Dataframes to See if One Sits Inside the Range of the Other's columns

问题

我有两个如下的数据框

```python
import pandas as pd
import itertools

df_test = {
    "col1": [1, 5, 6, 5],
    "col2": [4, 5, 6, 7],
    "col3": [5, 9, 10, 11],
    "col4": [11, 10, 10, 4],
}
df_train = {
    "col1": [3, 6, 7, 4, 4],
    "col2": [5, 10, 5, 5, 5],
    "col3": [6, 8, 9, 10, 6],
    "col4": [11, 10, 8, 12, 6],
}
df_train = pd.DataFrame(data=df_train)
df_test = pd.DataFrame(data=df_test)

我已经找到了df_train的最小值和最大值,想要查看df_test是否在这些值范围内,我想遍历每一行,并根据训练集的每列给出的最小值和最大值不断地更改。我不断地比较df_test行中的每个值,看它是否在最小值和最大值内。如果是,则将True插入到一个列表中,否则将插入False。

# 寻找最小最大值
df_train_minMax = df_train.agg([min, max])

# 确定测试数据的列数和行数
columns = len(df_test.columns)
rows = len(df_test.index)
df_train_minMax看起来像

     col1  col2  col3  col4
min     3     5     6     6
max     7    10    10    12

我创建了两个列表 - list接收df_test每行的True或False值。list2接收False值的索引位置。然后我创建了一个循环来遍历df_test,尽管它变得很难阅读。我确信有一种更好更简单的方法来做到这一点。

list = []
list2 = []

def inRange(min, max):
    x = 0
    # i = index, j = columns
    for i, j in itertools.product(range(rows), range(columns)):
        if df_test.iloc[i, j] >= min and df_test.iloc[i, j] <= max:
            x = x + 1
        else:
            x = x + 0
            list2.append((i, j))
    if x == columns:
        list.insert(i, "True")
    else:
        list.insert(i, "False")

# 调用函数需要根据df_train_minMax中每列的最小/最大值不断更新数值
inRange(5, 10)

print(list) 
print(list2)  

预期输出:

list = [False, True, True, False]
list2 = [(0, 0), (0, 1), (0, 3), (3, 2), (3, 3)] 或类似

调用函数应该根据df_train_minMax中的列不断更新最小值和最大值,这是我正在努力编码的部分。

英文:

I have 2 dataframes as follows:

import pandas as pd
import itertools

df_test = {
    &quot;col1&quot;: [1, 5, 6, 5],
    &quot;col2&quot;: [4, 5, 6, 7],
    &quot;col3&quot;: [5, 9, 10, 11],
    &quot;col4&quot;: [11, 10, 10, 4],
}
df_train = {
    &quot;col1&quot;: [3, 6, 7, 4, 4],
    &quot;col2&quot;: [5, 10, 5, 5, 5],
    &quot;col3&quot;: [6, 8, 9, 10, 6],
    &quot;col4&quot;: [11, 10, 8, 12, 6],
}
df_train = pd.DataFrame(data=df_train)
df_test = pd.DataFrame(data=df_test)

I have already found the min and max values of df_train and want to see if df_test sits within these values, I want to iterate through each row, continually changing the min and max values as given for each column of the training set. I continually compare each value in the df_test row to see if it fits inside the min and max. If yes, then i will insert True to a list, else I'll write False.

# find minMax
df_train_minMax = df_train.agg([min, max])

# identify no of columns and rows in test data
columns = len(df_test.columns)
rows = len(df_test.index)
df_train_minMax looks like:

     col1  col2  col3  col4
min     3     5     6     6
max     7    10    10    12

I have made 2 lists - list receives True or False values per row of df_test. list2 receives the index location of the False values. And then I've made a loop to iterate through df_test, although its becoming very complicated to read. I am certain there is a better and simpler way of doing this.

list = []
list2 = []

def inRange(min, max):
    x = 0
    # i = index, j = columns
    for i, j in itertools.product(range(rows), range(columns)):
        if df_test.iloc[i, j] &gt;= min and df_test.iloc[i, j] &lt;= max:
            x = x + 1
        else:
            x = x + 0
            list2.append((i, j))
        # print(x)
    if x == columns:
        list.insert(i, &quot;True&quot;)
    else:
        list.insert(i, &quot;False&quot;)

# calling function would need the values to continually change per min/max of each column
inRange(5, 10)

print(list) 
print(list2)  

Expected:

list = [False, True, True, False]
list2 = [(0, 0), (0, 1), (0, 3), (3, 2), (3, 3)] or similar

calling function should continually update the min and max values in accordance with the columns in df_train_minMax, which is what I'm struggling to code.

答案1

得分: 1

&gt;&gt;&gt; df_mask = df_test.mask(lambda x: (x&gt;=5) &amp; (x&lt;=10))
&gt;&gt;&gt; df_mask.values.tolist() 
[[1.0, 4.0, nan, 11.0], [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, 11.0, 4.0]]

在这一步中遍历每个元素如果只包含True表示整行都符合条件在最小值和最大值之间即5和10),则插入True”。

&gt;&gt;&gt; min_num, max_num = 5, 10
&gt;&gt;&gt; df_mask = df_test.mask(lambda x: (x&gt;=min_num) &amp; (x&lt;=max_num))
&gt;&gt;&gt; df_mask
   col1  col2  col3  col4
0   1.0   4.0   NaN  11.0
1   NaN   NaN   NaN   NaN
2   NaN   NaN   NaN   NaN
3   NaN   NaN  11.0   4.0
&gt;&gt;&gt; df_mask = df_mask.isnull()
&gt;&gt;&gt; df_mask
    col1   col2   col3   col4
0  False  False   True  False
1   True   True   True   True
2   True   True   True   True
3   True   True  False  False
&gt;&gt;&gt; df_mask = df_mask.values.tolist()
&gt;&gt;&gt; df_mask
[[False, False, True, False], [True, True, True, True], [True, True, True, True], [True, True, False, False]]
# ...
&gt;&gt;&gt; list2 = [&quot;True&quot; if (len(set(i)) == 1 and list(set(i))[0] == True) else &quot;False&quot; for i in df_mask]
&gt;&gt;&gt; list2
[&#39;False&#39;, &#39;True&#39;, &#39;True&#39;, &#39;False&#39;]

&lt;br&gt;

# 编辑:

添加了inRange函数并修改了参数

```python
def inRange(min_num, max_num, df):
    # 只有True值符合条件 min &lt; x &gt; max 
    # (可选)使用列表值创建列
    return df.mask(lambda x: (x&gt;=min_num) &amp; (x&lt;=max_num)).isnull()

for col in df_train.tolist():
     min_num = min(col)
     max_num = max(col)
     df1 = inRange(min_num, max_num, df_test)
     print([[&quot;True&quot; if (len(set(i)) == 1 and list(set(i))[0] == True) else &quot;False&quot; for i in df1.tolist()])
英文:

Replace values of each row using mask and setting as conditional numbers less or equal and greater or equal than x that is the value in a dataframe you iterate

&gt;&gt;&gt; df_mask = df_test.mask(lambda x: (x&gt;=5) &amp; (x&lt;=10))
&gt;&gt;&gt; df_mask.values.tolist() 
[[1.0, 4.0, nan, 11.0], [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, 11.0, 4.0]]

At this point pass through each element and if just contains True values means that you insert &quot;True&quot; because that entire row match for your condition between min and max numbers (5, 10)

&gt;&gt;&gt; min_num, max_num = 5, 10
&gt;&gt;&gt; df_mask = df_test.mask(lambda x: (x&gt;=min_num) &amp; (x&lt;=max_num))
&gt;&gt;&gt; df_mask
   col1  col2  col3  col4
0   1.0   4.0   NaN  11.0
1   NaN   NaN   NaN   NaN
2   NaN   NaN   NaN   NaN
3   NaN   NaN  11.0   4.0
&gt;&gt;&gt; df_mask = df_mask.isnull()
&gt;&gt;&gt; df_mask
    col1   col2   col3   col4
0  False  False   True  False
1   True   True   True   True
2   True   True   True   True
3   True   True  False  False
&gt;&gt;&gt; df_mask = df_mask.values.tolist()
&gt;&gt;&gt; df_mask
[[False, False, True, False], [True, True, True, True], [True, True, True, True], [True, True, False, False]]
# ...
&gt;&gt;&gt; list2 = [&quot;True&quot; if (len(set(i)) == 1 and list(set(i))[0] == True) else &quot;False&quot; for i in df_mask]
&gt;&gt;&gt; list2
[&#39;False&#39;, &#39;True&#39;, &#39;True&#39;, &#39;False&#39;]

<br>

EDIT:

added inRange function and modified args

def inRange(min_num, max_num, df):
    # only True values match condition min &lt; x &gt; max 
    # (optional) create colum using list values
    return df.mask(lambda x: (x&gt;=min_num) &amp; (x&lt;=max_num)).isnull()

for col in df_train.tolist():
     min_num = min(col)
     max_num = max(col)
     df1 = inRange(min_num, max_num, df_test)
     print([[&quot;True&quot; if (len(set(i)) == 1 and list(set(i))[0] == True) else &quot;False&quot; for i in df1.tolist()])

huangapple
  • 本文由 发表于 2023年1月9日 03:21:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/75050640.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定