英文:
compare two string in partial
问题
Here's the code with the necessary modifications to check for partial matches between the values in two columns:
import pandas as pd
import re
data_in = {'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']}
df_input = pd.DataFrame(data_in)
def compare_func(row):
pattern = re.compile('.*' + re.escape(row['col1']) + '.*')
if re.search(pattern, row['col2']):
return 'Match'
else:
return 'Mismatch'
df_input['col3'] = df_input.apply(compare_func, axis=1)
print(df_input)
Now, this modified code should correctly identify partial matches between the values in 'col1' and 'col2' and label them as 'Match' accordingly.
英文:
I want to compare the col1 and the col2 which one is matched in partial.
I wrote the code like below, but the first row show 'Misamtch' not 'Match'
import pandas as pd
data_in = {'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']}
df_input = pd.DataFrame(data_in)
data_out = {'col1': ['BANQ1049576495','HLCUSEL221162979','SEL1469779'],
'col2': ['KNKX1049576495','SEL221162979','KROL1020107403'],
'col3':['Match','Match','Mismatch']}
df_ouput = pd.DataFrame(data_out)
def compare_func(row):
pattern = re.compile('.*' + re.escape(row['col1']) + '.*')
if re.search(pattern, row['col2']):
return 'Match'
else:
return 'Mismatch'
data_in['col3'] = data_in.apply(compare_func, axis=1)
data_ouput = data_in
Can you Please fit the code so that even if the values in two columns are partially matched, it will be checked as a match?
答案1
得分: 2
如果要逐行比较值,您可以使用str.extract
:
c1 = df_input['col1'].str.extract(r'(\d+)$', expand=False)
c2 = df_input['col2'].str.extract(r'(\d+)$', expand=False)
df_input['col3'] = np.where(c1 == c2, '匹配', '不匹配')
输出:
>>> df_input
col1 col2 col3
0 BANQ1049576495 KNKX1049576495 匹配
1 HLCUSEL221162979 SEL221162979 匹配
2 SEL1469779 KROL1020107403 不匹配
英文:
If you want to compare values row by row, you can use str.extract
:
c1 = df_input['col1'].str.extract(r'(\d+)$', expand=False)
c2 = df_input['col2'].str.extract(r'(\d+)$', expand=False)
df_input['col3'] = np.where(c1 == c2, 'Match', 'Mismatch')
Output:
>>> df_input
col1 col2 col3
0 BANQ1049576495 KNKX1049576495 Match
1 HLCUSEL221162979 SEL221162979 Match
2 SEL1469779 KROL1020107403 Mismatch
答案2
得分: 1
以下是翻译好的内容:
IIUC,您可以使用 [tag:thefuzz]:
# 我正在使用 `df` 而不是 `df_input`
# stackoverflow.com/a/71899589/15239951
# stackoverflow.com/a/69169135/15239951
# pip 安装 thefuzz
# pip 安装 python-Levenshtein
from thefuzz import fuzz
R = 70 # 可根据需要调整比率
df["col3 (fw)"] = ["Match" if fuzz.ratio(c1, c2) >= R else "Mismatch"
for (c1, c2) in zip(df["col1"], df["col2"])]
或者,按照您的正则表达式方法,匹配数字序列:
def match_str(s, regex=r"(?<=[A-Z])[0-9]+"):
return re.search(regex, s).group() if re.search(regex, s) else None
df["col3 (re)"] = ["Match" if match_str(s1) == match_str(s2) else "Mismatch"
for s1, s2 in zip(df["col1"], df["col2"])]
输出:
print(df)
col1 col2 col3 (fw) col3 (re)
0 BANQ1049576495 KNKX1049576495 Match Match
1 HLCUSEL221162979 SEL221162979 Match Match
2 SEL1469779 KROL1020107403 Mismatch Mismatch
英文:
IIUC, you can use [tag:thefuzz] :
#I'm using `df` instead of `df_input`
#stackoverflow.com/a/71899589/15239951
#stackoverflow.com/a/69169135/15239951
#pip install thefuzz
#pip install python-Levenshtein
from thefuzz import fuzz
R = 70 # feel free to adjust the ratio
df["col3 (fw)"] = ["Match" if fuzz.ratio(c1, c2) >= R else "Mismatch"
for (c1, c2) in zip(df["col1"], df["col2"])]
Or, following your regex approach, match on the sequence of numbers :
def match_str(s, regex=r"(?<=[A-Z])[0-9]+"):
return re.search(regex, s).group() if re.search(regex, s) else None
df["col3 (re)"] = ["Match" if match_str(s1) == match_str(s2) else "Mismatch"
for s1, s2 in zip(df["col1"], df["col2"])]
Output :
print(df)
col1 col2 col3 (fw) col3 (re)
0 BANQ1049576495 KNKX1049576495 Match Match
1 HLCUSEL221162979 SEL221162979 Match Match
2 SEL1469779 KROL1020107403 Mismatch Mismatch
答案3
得分: 0
你的问题中并不清楚如何定义“部分匹配”。从你提供的示例来看,如果两个字符串具有相同的后缀,并且后缀的长度不小于字符串中数字的长度,那么它们被认为是“部分匹配”。我将根据这个观察给出我的答案。
你可以使用以下代码:
import re
# str1 和 str2 是两个字符串
def is_partially_match(str1, str2):
letters1, nums1, _ = re.split('(\d+)', str1)
letters2, nums2, _ = re.split('(\d+)', str2)
return '匹配' if nums1 == nums2 else '不匹配'
英文:
It is not clear in your question how "partially match" is defined. From your provided examples, it seems that two strings are "partially matched", if they have the same suffix, and the suffix length no less than the length of numbers in the string. I will give my answer based on this observation.
You can use the following code:
import re
# str1 and str2 are the two strings
def is_partially_match(str1, str2):
letters1, nums1, _ = re.split('(\d+)', str1)
letters2, nums2, _ = re.split('(\d+)', str2)
return 'Match' if nums1 == nums2 else 'Mismatch'
答案4
得分: 0
It looks as though you're trying to match on the numeric part in which case:
import re
data_in = {
'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']
}
FIND = re.compile(r'\d+')
for a, b in zip(data_in['col1'], data_in['col2']):
m = 'Match' if FIND.findall(a) == FIND.findall(b) else 'Mismatch'
data_in.setdefault('col3', []).append(m)
print(data_in)
Output:
{'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'], 'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403'], 'col3': ['Match', 'Match', 'Mismatch']}
英文:
It looks as though you're trying to match on the numeric part in which case:
import re
data_in = {
'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'],
'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403']
}
FIND = re.compile(r'\d+')
for a, b in zip(data_in['col1'], data_in['col2']):
m = 'Match' if FIND.findall(a) == FIND.findall(b) else 'Mismatch'
data_in.setdefault('col3', []).append(m)
print(data_in)
Output:
{'col1': ['BANQ1049576495', 'HLCUSEL221162979', 'SEL1469779'], 'col2': ['KNKX1049576495', 'SEL221162979', 'KROL1020107403'], 'col3': ['Match', 'Match', 'Mismatch']}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论