获取数据框中列的最接近匹配。

huangapple go评论96阅读模式
英文:

Get Closest match for a column in data frame

问题

我有一个包含不同呼叫类型的数据框,如下所示的值

  1. CallType
  2. 0 IN
  3. 1 OUT
  4. 2 a_in
  5. 3 asms
  6. 4 INCOMING
  7. 5 OUTGOING
  8. 6 A2P_SMSIN
  9. 7 ain
  10. 8 aout

我希望将其映射,以便输出如下

  1. CallType
  2. 0 IN
  3. 1 OUT
  4. 2 IN
  5. 3 SMS
  6. 4 IN
  7. 5 OUT
  8. 6 SMS
  9. 7 IN
  10. 8 OUT

我试图使用difflib.closestmatch,但它没有结果。以下是我的代码

  1. CALL_TYPE=['IN','OUT','SMS','VOICE','SMT']
  2. def test1():
  3. final_file_data = pd.DataFrame({
  4. 'CallType': ['IN', 'OUT', 'a_in',
  5. 'asms', 'INCOMING', 'OUTGOING','A2P_SMSIN',
  6. 'ain', 'aout']})
  7. print(final_file_data)
  8. final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x, CALL_TYPE, n=1))

我得到的输出如下,只有IN和OUT有结果

  1. CallType
  2. 0 [IN]
  3. 1 [OUT]
  4. 2 []
  5. 3 []
  6. 4 []
  7. 5 []
  8. 6 []
  9. 7 []
  10. 8 []

我不确定我哪里出错了。

英文:

I have a data Frame which contains different call types as below values

  1. CallType
  2. 0 IN
  3. 1 OUT
  4. 2 a_in
  5. 3 asms
  6. 4 INCOMING
  7. 5 OUTGOING
  8. 6 A2P_SMSIN
  9. 7 ain
  10. 8 aout

I want to map this in such a way the output would be

  1. CallType
  2. 0 IN
  3. 1 OUT
  4. 2 IN
  5. 3 SMS
  6. 4 IN
  7. 5 OUT
  8. 6 SMS
  9. 7 IN
  10. 8 OUT

I am trying to use difflib.closestmatch but it gives no result . Below is my code

  1. CALL_TYPE=['IN','OUT','SMS','VOICE','SMT']
  2. def test1():
  3. final_file_data = pd.DataFrame({
  4. 'CallType': ['IN', 'OUT', 'a_in',
  5. 'asms', 'INCOMING', 'OUTGOING','A2P_SMSIN',
  6. 'ain', 'aout']})
  7. print(final_file_data)
  8. final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x, CALL_TYPE, n=1))

The output I get is below which as results only for IN and OUT

  1. CallType
  2. 0 [IN]
  3. 1 [OUT]
  4. 2 []
  5. 3 []
  6. 4 []
  7. 5 []
  8. 6 []
  9. 7 []
  10. 8 []

I am not sure where I am going wrong .

答案1

得分: 1

这与 get_close_matches 是区分大小写的以及用于相似度得分的 cutoff 有关。您可以将字符串 x 转换为大写 (upper()),并将 cutoff 设置得更宽松。这就是我所做的:

  1. final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x.upper(), CALL_TYPE, n=1, cutoff=0))

现在的 final_file_data 是:

  1. CallType
  2. 0 [IN]
  3. 1 [OUT]
  4. 2 [IN]
  5. 3 [SMS]
  6. 4 [IN]
  7. 5 [OUT]
  8. 6 [SMS]
  9. 7 [IN]
  10. 8 [OUT]

您可以在 这里 了解更多关于 get_close_matchescutoff 参数的信息。

英文:

It has to do with get_close_matches being case-sensitive and the cutoff for the score that is gotten for similarity. You can manipulate the x string to upper() and change the cutoff to be less stringent. This is what I did:

  1. final_file_data['CallType'] = final_file_data['CallType'].apply(lambda x: difflib.get_close_matches(x.upper(), CALL_TYPE, n=1, cutoff=0))

final_file_data is now:

  1. CallType
  2. 0 [IN]
  3. 1 [OUT]
  4. 2 [IN]
  5. 3 [SMS]
  6. 4 [IN]
  7. 5 [OUT]
  8. 6 [SMS]
  9. 7 [IN]
  10. 8 [OUT]

You can read more about the get_close_matches here to read more about the cutoff argument.

huangapple
  • 本文由 发表于 2023年3月23日 12:03:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/75819169.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定