2023年5月15日 12:04:18go评论53阅读模式

英文:

compare two string in partial

问题

Here's the code with the necessary modifications to check for partial matches between the values in two columns:

import pandas as pd
import re

data_in = {'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
           'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']}
df_input = pd.DataFrame(data_in)

def compare_func(row):
    pattern = re.compile('.*' + re.escape(row['col1']) + '.*')
    if re.search(pattern, row['col2']):
        return 'Match'
    else:
        return 'Mismatch'

df_input['col3'] = df_input.apply(compare_func, axis=1)
print(df_input)

Now, this modified code should correctly identify partial matches between the values in 'col1' and 'col2' and label them as 'Match' accordingly.

英文:

I want to compare the col1 and the col2 which one is matched in partial.
I wrote the code like below, but the first row show 'Misamtch' not 'Match'

import pandas as pd

data_in = {&#39;col1&#39;: [&#39;BANQ1049576495&#39;, &#39;HLCUSEL221162979&#39;, &#39;SEL1469779&#39;],
           &#39;col2&#39;: [&#39;KNKX1049576495&#39;, &#39;SEL221162979&#39;, &#39;KROL1020107403&#39;]}
df_input = pd.DataFrame(data_in)

data_out = {&#39;col1&#39;: [&#39;BANQ1049576495&#39;,&#39;HLCUSEL221162979&#39;,&#39;SEL1469779&#39;],
            &#39;col2&#39;: [&#39;KNKX1049576495&#39;,&#39;SEL221162979&#39;,&#39;KROL1020107403&#39;],
            &#39;col3&#39;:[&#39;Match&#39;,&#39;Match&#39;,&#39;Mismatch&#39;]}
df_ouput = pd.DataFrame(data_out)


def compare_func(row):
pattern = re.compile(&#39;.*&#39; + re.escape(row[&#39;col1&#39;]) + &#39;.*&#39;)
if re.search(pattern, row[&#39;col2&#39;]):
return &#39;Match&#39;
else:
return &#39;Mismatch&#39;  

data_in[&#39;col3&#39;] = data_in.apply(compare_func, axis=1)
data_ouput = data_in

Can you Please fit the code so that even if the values in two columns are partially matched, it will be checked as a match?

答案1

得分: 2

如果要逐行比较值，您可以使用str.extract：

c1 = df_input['col1'].str.extract(r'(\d+)$', expand=False)
c2 = df_input['col2'].str.extract(r'(\d+)$', expand=False)
df_input['col3'] = np.where(c1 == c2, '匹配', '不匹配')

输出：

>>> df_input
               col1            col2      col3
0    BANQ1049576495  KNKX1049576495     匹配
1  HLCUSEL221162979    SEL221162979     匹配
2        SEL1469779  KROL1020107403  不匹配

英文:

If you want to compare values row by row, you can use str.extract:

c1 = df_input[&#39;col1&#39;].str.extract(r&#39;(\d+)$&#39;, expand=False)
c2 = df_input[&#39;col2&#39;].str.extract(r&#39;(\d+)$&#39;, expand=False)
df_input[&#39;col3&#39;] = np.where(c1 == c2, &#39;Match&#39;, &#39;Mismatch&#39;)

Output:

&gt;&gt;&gt; df_input
               col1            col2      col3
0    BANQ1049576495  KNKX1049576495     Match
1  HLCUSEL221162979    SEL221162979     Match
2        SEL1469779  KROL1020107403  Mismatch

答案2

得分: 1

以下是翻译好的内容：

IIUC，您可以使用 [tag:thefuzz]：

# 我正在使用 `df` 而不是 `df_input`
# stackoverflow.com/a/71899589/15239951
# stackoverflow.com/a/69169135/15239951

# pip 安装 thefuzz
# pip 安装 python-Levenshtein
from thefuzz import fuzz

R = 70 # 可根据需要调整比率

df["col3 (fw)"] = ["Match" if fuzz.ratio(c1, c2) >= R else "Mismatch"
                   for (c1, c2) in zip(df["col1"], df["col2"])]

或者，按照您的正则表达式方法，匹配数字序列：

def match_str(s, regex=r"(?<=[A-Z])[0-9]+"):
    return re.search(regex, s).group() if re.search(regex, s) else None

df["col3 (re)"] = ["Match" if match_str(s1) == match_str(s2) else "Mismatch"
                   for s1, s2 in zip(df["col1"], df["col2"])]

输出：

print(df)

               col1            col2 col3 (fw) col3 (re)
0    BANQ1049576495  KNKX1049576495     Match     Match
1  HLCUSEL221162979    SEL221162979     Match     Match
2        SEL1469779  KROL1020107403  Mismatch  Mismatch

英文:

IIUC, you can use [tag:thefuzz] :

#I&#39;m using `df` instead of `df_input`
#stackoverflow.com/a/71899589/15239951 
#stackoverflow.com/a/69169135/15239951

#pip install thefuzz
#pip install python-Levenshtein
from thefuzz import fuzz

R = 70 # feel free to adjust the ratio

df[&quot;col3 (fw)&quot;] = [&quot;Match&quot; if fuzz.ratio(c1, c2) &gt;= R else &quot;Mismatch&quot;
                   for (c1, c2) in zip(df[&quot;col1&quot;], df[&quot;col2&quot;])]

Or, following your regex approach, match on the sequence of numbers :

def match_str(s, regex=r&quot;(?&lt;=[A-Z])[0-9]+&quot;):
    return re.search(regex, s).group() if re.search(regex, s) else None

df[&quot;col3 (re)&quot;] = [&quot;Match&quot; if match_str(s1) == match_str(s2) else &quot;Mismatch&quot;
                   for s1, s2 in zip(df[&quot;col1&quot;], df[&quot;col2&quot;])]

Output :

print(df)

               col1            col2 col3 (fw) col3 (re)
0    BANQ1049576495  KNKX1049576495     Match     Match
1  HLCUSEL221162979    SEL221162979     Match     Match
2        SEL1469779  KROL1020107403  Mismatch  Mismatch

答案3

得分: 0

你的问题中并不清楚如何定义“部分匹配”。从你提供的示例来看，如果两个字符串具有相同的后缀，并且后缀的长度不小于字符串中数字的长度，那么它们被认为是“部分匹配”。我将根据这个观察给出我的答案。

你可以使用以下代码：

import re

# str1 和 str2 是两个字符串
def is_partially_match(str1, str2):
    letters1, nums1, _ = re.split('(\d+)', str1)
    letters2, nums2, _ = re.split('(\d+)', str2)
    return '匹配' if nums1 == nums2 else '不匹配'

英文:

It is not clear in your question how "partially match" is defined. From your provided examples, it seems that two strings are "partially matched", if they have the same suffix, and the suffix length no less than the length of numbers in the string. I will give my answer based on this observation.

You can use the following code:

import re

# str1 and str2 are the two strings
def is_partially_match(str1, str2):
    letters1, nums1, _ = re.split(&#39;(\d+)&#39;, str1)
    letters2, nums2, _ = re.split(&#39;(\d+)&#39;, str2)
    return &#39;Match&#39; if nums1 == nums2 else &#39;Mismatch&#39;

答案4

得分: 0

It looks as though you're trying to match on the numeric part in which case:

import re

data_in = {
    'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
    'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']
}

FIND = re.compile(r'\d+')

for a, b in zip(data_in['col1'], data_in['col2']):
    m = 'Match' if FIND.findall(a) == FIND.findall(b) else 'Mismatch'
    data_in.setdefault('col3', []).append(m)

print(data_in)

Output:

{'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'], 'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403'], 'col3': ['Match', 'Match', 'Mismatch']}

英文:

It looks as though you're trying to match on the numeric part in which case:

import re

data_in = {
    &#39;col1&#39;: [&#39;BANQ1049576495&#39;, &#39;HLCUSEL221162979&#39;, &#39;SEL1469779&#39;],
    &#39;col2&#39;: [&#39;KNKX1049576495&#39;, &#39;SEL221162979&#39;, &#39;KROL1020107403&#39;]
}

FIND = re.compile(r&#39;\d+&#39;)

for a, b in zip(data_in[&#39;col1&#39;], data_in[&#39;col2&#39;]):
    m = &#39;Match&#39; if FIND.findall(a) == FIND.findall(b) else &#39;Mismatch&#39;
    data_in.setdefault(&#39;col3&#39;, []).append(m)

print(data_in)

Output:

{&#39;col1&#39;: [&#39;BANQ1049576495&#39;, &#39;HLCUSEL221162979&#39;, &#39;SEL1469779&#39;], &#39;col2&#39;: [&#39;KNKX1049576495&#39;, &#39;SEL221162979&#39;, &#39;KROL1020107403&#39;], &#39;col3&#39;: [&#39;Match&#39;, &#39;Match&#39;, &#39;Mismatch&#39;]}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

比较两个部分字符串。

问题

答案1

答案2

答案3

答案4

Python的Requests Get无法用于“HTTP”请求。

获取数据框中的重复行并覆盖它们 Python

Pandas 数据帧行转为 CSV 行无法正常工作

这个问题可以用动态规划进行优化吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论