比较两个部分字符串。

huangapple go评论53阅读模式
英文:

compare two string in partial

问题

Here's the code with the necessary modifications to check for partial matches between the values in two columns:

import pandas as pd
import re

data_in = {'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
           'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']}
df_input = pd.DataFrame(data_in)

def compare_func(row):
    pattern = re.compile('.*' + re.escape(row['col1']) + '.*')
    if re.search(pattern, row['col2']):
        return 'Match'
    else:
        return 'Mismatch'

df_input['col3'] = df_input.apply(compare_func, axis=1)
print(df_input)

Now, this modified code should correctly identify partial matches between the values in 'col1' and 'col2' and label them as 'Match' accordingly.

英文:

I want to compare the col1 and the col2 which one is matched in partial.
I wrote the code like below, but the first row show 'Misamtch' not 'Match'

import pandas as pd

data_in = {'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
           'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']}
df_input = pd.DataFrame(data_in)

data_out = {'col1': ['BANQ1049576495','HLCUSEL221162979','SEL1469779'],
            'col2': ['KNKX1049576495','SEL221162979','KROL1020107403'],
            'col3':['Match','Match','Mismatch']}
df_ouput = pd.DataFrame(data_out)


def compare_func(row):
pattern = re.compile('.*' + re.escape(row['col1']) + '.*')
if re.search(pattern, row['col2']):
return 'Match'
else:
return 'Mismatch'  

data_in['col3'] = data_in.apply(compare_func, axis=1)
data_ouput = data_in

Can you Please fit the code so that even if the values in two columns are partially matched, it will be checked as a match?

答案1

得分: 2

如果要逐行比较值,您可以使用str.extract

c1 = df_input['col1'].str.extract(r'(\d+)$', expand=False)
c2 = df_input['col2'].str.extract(r'(\d+)$', expand=False)
df_input['col3'] = np.where(c1 == c2, '匹配', '不匹配')

输出:

>>> df_input
               col1            col2      col3
0    BANQ1049576495  KNKX1049576495     匹配
1  HLCUSEL221162979    SEL221162979     匹配
2        SEL1469779  KROL1020107403  不匹配
英文:

If you want to compare values row by row, you can use str.extract:

c1 = df_input['col1'].str.extract(r'(\d+)$', expand=False)
c2 = df_input['col2'].str.extract(r'(\d+)$', expand=False)
df_input['col3'] = np.where(c1 == c2, 'Match', 'Mismatch')

Output:

>>> df_input
               col1            col2      col3
0    BANQ1049576495  KNKX1049576495     Match
1  HLCUSEL221162979    SEL221162979     Match
2        SEL1469779  KROL1020107403  Mismatch

答案2

得分: 1

以下是翻译好的内容:

IIUC,您可以使用 [tag:thefuzz]:

# 我正在使用 `df` 而不是 `df_input`
# stackoverflow.com/a/71899589/15239951
# stackoverflow.com/a/69169135/15239951

# pip 安装 thefuzz
# pip 安装 python-Levenshtein
from thefuzz import fuzz

R = 70 # 可根据需要调整比率

df["col3 (fw)"] = ["Match" if fuzz.ratio(c1, c2) >= R else "Mismatch"
                   for (c1, c2) in zip(df["col1"], df["col2"])]

或者,按照您的正则表达式方法,匹配数字序列:

def match_str(s, regex=r"(?<=[A-Z])[0-9]+"):
    return re.search(regex, s).group() if re.search(regex, s) else None

df["col3 (re)"] = ["Match" if match_str(s1) == match_str(s2) else "Mismatch"
                   for s1, s2 in zip(df["col1"], df["col2"])]

输出:

print(df)

               col1            col2 col3 (fw) col3 (re)
0    BANQ1049576495  KNKX1049576495     Match     Match
1  HLCUSEL221162979    SEL221162979     Match     Match
2        SEL1469779  KROL1020107403  Mismatch  Mismatch
英文:

IIUC, you can use [tag:thefuzz] :

#I&#39;m using `df` instead of `df_input`
#stackoverflow.com/a/71899589/15239951 
#stackoverflow.com/a/69169135/15239951

#pip install thefuzz
#pip install python-Levenshtein
from thefuzz import fuzz

R = 70 # feel free to adjust the ratio

df[&quot;col3 (fw)&quot;] = [&quot;Match&quot; if fuzz.ratio(c1, c2) &gt;= R else &quot;Mismatch&quot;
                   for (c1, c2) in zip(df[&quot;col1&quot;], df[&quot;col2&quot;])]

Or, following your regex approach, match on the sequence of numbers :

def match_str(s, regex=r&quot;(?&lt;=[A-Z])[0-9]+&quot;):
    return re.search(regex, s).group() if re.search(regex, s) else None

df[&quot;col3 (re)&quot;] = [&quot;Match&quot; if match_str(s1) == match_str(s2) else &quot;Mismatch&quot;
                   for s1, s2 in zip(df[&quot;col1&quot;], df[&quot;col2&quot;])]

Output :

print(df)

               col1            col2 col3 (fw) col3 (re)
0    BANQ1049576495  KNKX1049576495     Match     Match
1  HLCUSEL221162979    SEL221162979     Match     Match
2        SEL1469779  KROL1020107403  Mismatch  Mismatch

答案3

得分: 0

你的问题中并不清楚如何定义“部分匹配”。从你提供的示例来看,如果两个字符串具有相同的后缀,并且后缀的长度不小于字符串中数字的长度,那么它们被认为是“部分匹配”。我将根据这个观察给出我的答案。

你可以使用以下代码:

import re

# str1 和 str2 是两个字符串
def is_partially_match(str1, str2):
    letters1, nums1, _ = re.split('(\d+)', str1)
    letters2, nums2, _ = re.split('(\d+)', str2)
    return '匹配' if nums1 == nums2 else '不匹配'
英文:

It is not clear in your question how "partially match" is defined. From your provided examples, it seems that two strings are "partially matched", if they have the same suffix, and the suffix length no less than the length of numbers in the string. I will give my answer based on this observation.

You can use the following code:

import re

# str1 and str2 are the two strings
def is_partially_match(str1, str2):
    letters1, nums1, _ = re.split(&#39;(\d+)&#39;, str1)
    letters2, nums2, _ = re.split(&#39;(\d+)&#39;, str2)
    return &#39;Match&#39; if nums1 == nums2 else &#39;Mismatch&#39;

答案4

得分: 0

It looks as though you're trying to match on the numeric part in which case:

import re

data_in = {
    'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
    'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']
}

FIND = re.compile(r'\d+')

for a, b in zip(data_in['col1'], data_in['col2']):
    m = 'Match' if FIND.findall(a) == FIND.findall(b) else 'Mismatch'
    data_in.setdefault('col3', []).append(m)

print(data_in)

Output:

{'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'], 'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403'], 'col3': ['Match', 'Match', 'Mismatch']}
英文:

It looks as though you're trying to match on the numeric part in which case:

import re

data_in = {
    &#39;col1&#39;: [&#39;BANQ1049576495&#39;, &#39;HLCUSEL221162979&#39;, &#39;SEL1469779&#39;],
    &#39;col2&#39;: [&#39;KNKX1049576495&#39;, &#39;SEL221162979&#39;, &#39;KROL1020107403&#39;]
}

FIND = re.compile(r&#39;\d+&#39;)

for a, b in zip(data_in[&#39;col1&#39;], data_in[&#39;col2&#39;]):
    m = &#39;Match&#39; if FIND.findall(a) == FIND.findall(b) else &#39;Mismatch&#39;
    data_in.setdefault(&#39;col3&#39;, []).append(m)

print(data_in)

Output:

{&#39;col1&#39;: [&#39;BANQ1049576495&#39;, &#39;HLCUSEL221162979&#39;, &#39;SEL1469779&#39;], &#39;col2&#39;: [&#39;KNKX1049576495&#39;, &#39;SEL221162979&#39;, &#39;KROL1020107403&#39;], &#39;col3&#39;: [&#39;Match&#39;, &#39;Match&#39;, &#39;Mismatch&#39;]}

huangapple
  • 本文由 发表于 2023年5月15日 12:04:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76250812.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定