Keyword to label mapping for list column in pandas

huangapple go评论67阅读模式
英文:

Keyword to label mapping for list column in pandas

问题

I have a column named surface-wordings in my df named difference, which has values like this:

['paths-modified-/clusters/{cluster_id}/hosts/{host_id}/instructions-operations-modified-POST-parameters-modified-body-reply-schema-properties-modified-step_type-enum-deleted']

I have a keyword to label mappings defined, and from this column I want to extract the keywords and assign the label under a new column named labels. The mapping is as follows:

keyword_label_mappings = {
    'POST-parameters-modified': 'POST Parameters Modified',
    'PUT-parameters-modified': 'PUT Parameters Modified',
}

I am not sure how could I achieve this, any suggestions or ideas will be greatly appreciated.

英文:

I have a column named surface-wordings in my df named difference, which has values like this:


['paths-modified-/clusters/{cluster_id}/hosts/{host_id}/instructions-operations-modified-POST-parameters-modified-body-reply-schema-properties-modified-step_type-enum-deleted'] 

I have a keyword to label mappings defined, and from this column I want to extract the keywords and assign the label under a new column named labels. The mapping is as follows:

keyword_label_mappings = {
    'POST-parameters-modified': 'POST Parameters Modified',
    'PUT-parameters-modified': 'PUT Parameters Modified',
}

I am not sure how could I achieve this, any suggestions or ideas will be greatly appreciated.

答案1

得分: 1

I slightly modified your keyword_label_mappings dict to have an output with your second sample:

keyword_label_mappings = {
    'POST-parameters-modified': 'POST参数已修改',
    'PUT-parameters-modified': 'PUT参数已修改',
    'POST-responses-modified': 'POST响应已修改',
    'DELETE-summary-from': '已更改DELETE摘要',
    'POST-responses-deleted': '已删除POST响应',
    'POST-parameters-added': '已添加POST参数',
    'POST-parameters-deleted': '已删除POST参数',
    'GET-summary-to': '已更改GET摘要至',  # 用于演示添加
    'GET-summary-from': '已更改GET摘要自',  # 用于演示添加
}

使用 `str.extract` 提取字典的键然后使用 `map` 替换为对应的值

```python
pattern = fr"({'|'.join(re.escape(k) for k in keyword_label_mappings)})"
difference['标签'] = (
    difference['surface_wordings'].explode().str.extractall(pattern)[0]
                                  .map(keyword_label_mappings).droplevel('match')
                                  .groupby(level=0).agg(list)
)

输出:

>>> difference
                                                                      surface_wordings                                      labels
63657  [paths-modified-/pets-operations-modified-GET-summary-from, paths-modified-/...  [已更改GET摘要自, 已更改GET摘要至]
63658  [info-version-from, info-version-to, paths-modified-/pets-operations-modifie...  [已更改GET摘要自, 已更改GET摘要至]
63659  [paths-modified-/pets-operations-modified-GET-summary-from, paths-modified-/...  [已更改GET摘要自, 已更改GET摘要至]
63661  [info-title-from, info-title-to, info-license-deleted, info-version-from, in...  [已更改GET摘要自, 已更改GET摘要至]
63662  [openAPI-from, openAPI-to, paths-added, paths-deleted, endpoints-added, endp...                           NaN
英文:

I slightly modified your keyword_label_mappings dict to have an output with your second sample:

keyword_label_mappings = {
    'POST-parameters-modified': 'POST Parameters Modified',
    'PUT-parameters-modified': 'PUT Parameters Modified',
    'POST-responses-modified': 'POST Responses Modified',
    'DELETE-summary-from': 'DELETE Summary Changed',
    'POST-responses-deleted': 'POST Responses Deleted',
    'POST-parameters-added': 'POST Parameters Added',
    'POST-parameters-deleted': 'POST Parameters Deleted',
    'GET-summary-to': 'GET Summary To',  # added for demo
    'GET-summary-from': 'GET Summary From',  # added for demo
}

Use str.extract to extract keys of your dict then map to replace with values:

pattern = fr"({'|'.join(re.escape(k) for k in keyword_label_mappings)})"

difference['labels'] = (
    difference['surface_wordings'].explode().str.extractall(pattern)[0]
                                  .map(keyword_label_mappings).droplevel('match')
                                  .groupby(level=0).agg(list)
)

Output:

>>> difference
                                                                      surface_wordings                              labels
63657  [paths-modified-/pets-operations-modified-GET-summary-from, paths-modified-/...  [GET Summary From, GET Summary To]
63658  [info-version-from, info-version-to, paths-modified-/pets-operations-modifie...  [GET Summary From, GET Summary To]
63659  [paths-modified-/pets-operations-modified-GET-summary-from, paths-modified-/...  [GET Summary From, GET Summary To]
63661  [info-title-from, info-title-to, info-license-deleted, info-version-from, in...  [GET Summary From, GET Summary To]
63662  [openAPI-from, openAPI-to, paths-added, paths-deleted, endpoints-added, endp...                                 NaN

答案2

得分: 0

Sure, here's the translated content without the code:

让我们来看一下你的算法,例如,我们有一行,列 surface-wordings 的值是:

在这一行之后的 for keyword_list in difference['surface_wordings']: 中,我们将提到的值放入 keyword_list 中。

之后的这一行 for keyword in keyword_list.split(', '): 会返回相同的字符串,因为在提到的行中没有 ", ",所以关键字与关键字列表相同。

这一行 keyword = keyword.split('-')[0] 将返回单词 [paths,因为我想你写的整个内容都是字符串。因此,这个新单词肯定不会与你的 keyword_label_mappings 字典中的键匹配,最终会返回 []

首先尝试调试你的代码,如果你正在使用调试工具,那会更好,如果没有,请使用 print 语句来查看分割等操作的值。此外,你需要检查所有的值,并尝试找出数据之间的某种模式,以便更好地拆分和获取你想要检查的键的算法。

此外,提供 difference.head() 可以更好地了解你的 df,如果之前的方法还不起作用的话。

英文:

Lets go over your algorithm, for example we have one row and the value of the column surface-wordings is:

[paths-modified-/clusters/{cluster_id}/hosts/{host_id}/instructions-operations-modified-POST-parameters-modified-body-reply-schema-properties-modified-step_type-enum-deleted] 

after this line for keyword_list in difference['surface_wordings']: we will have mentioned value inside keyword_list

After that the line for keyword in keyword_list.split(', '): will return the same string because there is no ", " inside the mentioned row, so keyword is the same as keyword list.

This line keyword = keyword.split('-')[0] will return the word [paths because the entire thing you wrote was string I suppose. So this new word is definitely not matching your keys in the keyword_label_mappings dictionary and it will return [] at the end.

First of all try debugging your code, if you are using debugging tools better and if not use print statements to see the values of splits and etc. Also you have to check all the values and should try to find some pattern between data in order to find better algorithm of splitting and getting the key you want for checking inside the dictionary.

Also it would be better to provide difference.head() to get better idea of your df if this one still does not work.

huangapple
  • 本文由 发表于 2023年5月22日 19:02:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76305499.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定