获取数据框中列的最接近匹配。

huangapple go评论64阅读模式
英文:

Get Closest match for a column in data frame

问题

我有一个包含不同呼叫类型的数据框,如下所示的值

    CallType
0         IN
1        OUT
2       a_in
3       asms
4   INCOMING
5   OUTGOING
6  A2P_SMSIN
7        ain
8       aout

我希望将其映射,以便输出如下

    CallType
0       IN
1       OUT
2       IN
3       SMS
4       IN
5       OUT
6       SMS
7       IN
8       OUT

我试图使用difflib.closestmatch,但它没有结果。以下是我的代码

CALL_TYPE=['IN','OUT','SMS','VOICE','SMT']

def test1():
    final_file_data = pd.DataFrame({
        'CallType': ['IN', 'OUT', 'a_in',
                         'asms', 'INCOMING', 'OUTGOING','A2P_SMSIN',
                         'ain', 'aout']})

    print(final_file_data)
    final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x, CALL_TYPE, n=1))

我得到的输出如下,只有IN和OUT有结果

 CallType
0     [IN]
1    [OUT]
2       []
3       []
4       []
5       []
6       []
7       []
8       []

我不确定我哪里出错了。

英文:

I have a data Frame which contains different call types as below values

    CallType
0         IN
1        OUT
2       a_in
3       asms
4   INCOMING
5   OUTGOING
6  A2P_SMSIN
7        ain
8       aout

I want to map this in such a way the output would be

    CallType
0       IN
1       OUT
2       IN
3       SMS
4       IN
5       OUT
6       SMS
7       IN
8       OUT

I am trying to use difflib.closestmatch but it gives no result . Below is my code

CALL_TYPE=['IN','OUT','SMS','VOICE','SMT']

def test1():
    final_file_data = pd.DataFrame({
        'CallType': ['IN', 'OUT', 'a_in',
                         'asms', 'INCOMING', 'OUTGOING','A2P_SMSIN',
                         'ain', 'aout']})

    print(final_file_data)
    final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x, CALL_TYPE, n=1))

The output I get is below which as results only for IN and OUT

 CallType
0     [IN]
1    [OUT]
2       []
3       []
4       []
5       []
6       []
7       []
8       []

I am not sure where I am going wrong .

答案1

得分: 1

这与 get_close_matches 是区分大小写的以及用于相似度得分的 cutoff 有关。您可以将字符串 x 转换为大写 (upper()),并将 cutoff 设置得更宽松。这就是我所做的:

final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x.upper(), CALL_TYPE, n=1, cutoff=0))

现在的 final_file_data 是:

  CallType
0     [IN]
1    [OUT]
2     [IN]
3    [SMS]
4     [IN]
5    [OUT]
6    [SMS]
7     [IN]
8    [OUT]

您可以在 这里 了解更多关于 get_close_matchescutoff 参数的信息。

英文:

It has to do with get_close_matches being case-sensitive and the cutoff for the score that is gotten for similarity. You can manipulate the x string to upper() and change the cutoff to be less stringent. This is what I did:

final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x.upper(), CALL_TYPE, n=1, cutoff=0))

final_file_data is now:

  CallType
0     [IN]
1    [OUT]
2     [IN]
3    [SMS]
4     [IN]
5    [OUT]
6    [SMS]
7     [IN]
8    [OUT]

You can read more about the get_close_matches here to read more about the cutoff argument.

huangapple
  • 本文由 发表于 2023年3月23日 12:03:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/75819169.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定