英文:
Get Closest match for a column in data frame
问题
我有一个包含不同呼叫类型的数据框,如下所示的值
CallType
0 IN
1 OUT
2 a_in
3 asms
4 INCOMING
5 OUTGOING
6 A2P_SMSIN
7 ain
8 aout
我希望将其映射,以便输出如下
CallType
0 IN
1 OUT
2 IN
3 SMS
4 IN
5 OUT
6 SMS
7 IN
8 OUT
我试图使用difflib.closestmatch
,但它没有结果。以下是我的代码
CALL_TYPE=['IN','OUT','SMS','VOICE','SMT']
def test1():
final_file_data = pd.DataFrame({
'CallType': ['IN', 'OUT', 'a_in',
'asms', 'INCOMING', 'OUTGOING','A2P_SMSIN',
'ain', 'aout']})
print(final_file_data)
final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x, CALL_TYPE, n=1))
我得到的输出如下,只有IN和OUT有结果
CallType
0 [IN]
1 [OUT]
2 []
3 []
4 []
5 []
6 []
7 []
8 []
我不确定我哪里出错了。
英文:
I have a data Frame which contains different call types as below values
CallType
0 IN
1 OUT
2 a_in
3 asms
4 INCOMING
5 OUTGOING
6 A2P_SMSIN
7 ain
8 aout
I want to map this in such a way the output would be
CallType
0 IN
1 OUT
2 IN
3 SMS
4 IN
5 OUT
6 SMS
7 IN
8 OUT
I am trying to use difflib.closestmatch but it gives no result . Below is my code
CALL_TYPE=['IN','OUT','SMS','VOICE','SMT']
def test1():
final_file_data = pd.DataFrame({
'CallType': ['IN', 'OUT', 'a_in',
'asms', 'INCOMING', 'OUTGOING','A2P_SMSIN',
'ain', 'aout']})
print(final_file_data)
final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x, CALL_TYPE, n=1))
The output I get is below which as results only for IN and OUT
CallType
0 [IN]
1 [OUT]
2 []
3 []
4 []
5 []
6 []
7 []
8 []
I am not sure where I am going wrong .
答案1
得分: 1
这与 get_close_matches
是区分大小写的以及用于相似度得分的 cutoff
有关。您可以将字符串 x
转换为大写 (upper()
),并将 cutoff
设置得更宽松。这就是我所做的:
final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x.upper(), CALL_TYPE, n=1, cutoff=0))
现在的 final_file_data
是:
CallType
0 [IN]
1 [OUT]
2 [IN]
3 [SMS]
4 [IN]
5 [OUT]
6 [SMS]
7 [IN]
8 [OUT]
您可以在 这里 了解更多关于 get_close_matches
和 cutoff
参数的信息。
英文:
It has to do with get_close_matches
being case-sensitive
and the cutoff
for the score that is gotten for similarity. You can manipulate the x
string to upper()
and change the cutoff
to be less stringent. This is what I did:
final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x.upper(), CALL_TYPE, n=1, cutoff=0))
final_file_data is now:
CallType
0 [IN]
1 [OUT]
2 [IN]
3 [SMS]
4 [IN]
5 [OUT]
6 [SMS]
7 [IN]
8 [OUT]
You can read more about the get_close_matches
here to read more about the cutoff
argument.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论