英文:
Keyword to label mapping for list column in pandas
问题
I have a column named surface-wordings
in my df named difference
, which has values like this:
['paths-modified-/clusters/{cluster_id}/hosts/{host_id}/instructions-operations-modified-POST-parameters-modified-body-reply-schema-properties-modified-step_type-enum-deleted']
I have a keyword to label mappings defined, and from this column I want to extract the keywords and assign the label under a new column named labels
. The mapping is as follows:
keyword_label_mappings = {
'POST-parameters-modified': 'POST Parameters Modified',
'PUT-parameters-modified': 'PUT Parameters Modified',
}
I am not sure how could I achieve this, any suggestions or ideas will be greatly appreciated.
英文:
I have a column named surface-wordings
in my df named difference
, which has values like this:
['paths-modified-/clusters/{cluster_id}/hosts/{host_id}/instructions-operations-modified-POST-parameters-modified-body-reply-schema-properties-modified-step_type-enum-deleted']
I have a keyword to label mappings defined, and from this column I want to extract the keywords and assign the label under a new column named labels
. The mapping is as follows:
keyword_label_mappings = {
'POST-parameters-modified': 'POST Parameters Modified',
'PUT-parameters-modified': 'PUT Parameters Modified',
}
I am not sure how could I achieve this, any suggestions or ideas will be greatly appreciated.
答案1
得分: 1
I slightly modified your keyword_label_mappings
dict to have an output with your second sample:
keyword_label_mappings = {
'POST-parameters-modified': 'POST参数已修改',
'PUT-parameters-modified': 'PUT参数已修改',
'POST-responses-modified': 'POST响应已修改',
'DELETE-summary-from': '已更改DELETE摘要',
'POST-responses-deleted': '已删除POST响应',
'POST-parameters-added': '已添加POST参数',
'POST-parameters-deleted': '已删除POST参数',
'GET-summary-to': '已更改GET摘要至', # 用于演示添加
'GET-summary-from': '已更改GET摘要自', # 用于演示添加
}
使用 `str.extract` 提取字典的键,然后使用 `map` 替换为对应的值:
```python
pattern = fr"({'|'.join(re.escape(k) for k in keyword_label_mappings)})"
difference['标签'] = (
difference['surface_wordings'].explode().str.extractall(pattern)[0]
.map(keyword_label_mappings).droplevel('match')
.groupby(level=0).agg(list)
)
输出:
>>> difference
surface_wordings labels
63657 [paths-modified-/pets-operations-modified-GET-summary-from, paths-modified-/... [已更改GET摘要自, 已更改GET摘要至]
63658 [info-version-from, info-version-to, paths-modified-/pets-operations-modifie... [已更改GET摘要自, 已更改GET摘要至]
63659 [paths-modified-/pets-operations-modified-GET-summary-from, paths-modified-/... [已更改GET摘要自, 已更改GET摘要至]
63661 [info-title-from, info-title-to, info-license-deleted, info-version-from, in... [已更改GET摘要自, 已更改GET摘要至]
63662 [openAPI-from, openAPI-to, paths-added, paths-deleted, endpoints-added, endp... NaN
英文:
I slightly modified your keyword_label_mappings
dict to have an output with your second sample:
keyword_label_mappings = {
'POST-parameters-modified': 'POST Parameters Modified',
'PUT-parameters-modified': 'PUT Parameters Modified',
'POST-responses-modified': 'POST Responses Modified',
'DELETE-summary-from': 'DELETE Summary Changed',
'POST-responses-deleted': 'POST Responses Deleted',
'POST-parameters-added': 'POST Parameters Added',
'POST-parameters-deleted': 'POST Parameters Deleted',
'GET-summary-to': 'GET Summary To', # added for demo
'GET-summary-from': 'GET Summary From', # added for demo
}
Use str.extract
to extract keys of your dict then map
to replace with values:
pattern = fr"({'|'.join(re.escape(k) for k in keyword_label_mappings)})"
difference['labels'] = (
difference['surface_wordings'].explode().str.extractall(pattern)[0]
.map(keyword_label_mappings).droplevel('match')
.groupby(level=0).agg(list)
)
Output:
>>> difference
surface_wordings labels
63657 [paths-modified-/pets-operations-modified-GET-summary-from, paths-modified-/... [GET Summary From, GET Summary To]
63658 [info-version-from, info-version-to, paths-modified-/pets-operations-modifie... [GET Summary From, GET Summary To]
63659 [paths-modified-/pets-operations-modified-GET-summary-from, paths-modified-/... [GET Summary From, GET Summary To]
63661 [info-title-from, info-title-to, info-license-deleted, info-version-from, in... [GET Summary From, GET Summary To]
63662 [openAPI-from, openAPI-to, paths-added, paths-deleted, endpoints-added, endp... NaN
答案2
得分: 0
Sure, here's the translated content without the code:
让我们来看一下你的算法,例如,我们有一行,列 surface-wordings
的值是:
在这一行之后的 for keyword_list in difference['surface_wordings']:
中,我们将提到的值放入 keyword_list
中。
之后的这一行 for keyword in keyword_list.split(', '):
会返回相同的字符串,因为在提到的行中没有 ", ",所以关键字与关键字列表相同。
这一行 keyword = keyword.split('-')[0]
将返回单词 [paths
,因为我想你写的整个内容都是字符串。因此,这个新单词肯定不会与你的 keyword_label_mappings
字典中的键匹配,最终会返回 []
。
首先尝试调试你的代码,如果你正在使用调试工具,那会更好,如果没有,请使用 print
语句来查看分割等操作的值。此外,你需要检查所有的值,并尝试找出数据之间的某种模式,以便更好地拆分和获取你想要检查的键的算法。
此外,提供 difference.head()
可以更好地了解你的 df
,如果之前的方法还不起作用的话。
英文:
Lets go over your algorithm, for example we have one row and the value of the column surface-wordings
is:
[paths-modified-/clusters/{cluster_id}/hosts/{host_id}/instructions-operations-modified-POST-parameters-modified-body-reply-schema-properties-modified-step_type-enum-deleted]
after this line for keyword_list in difference['surface_wordings']:
we will have mentioned value inside keyword_list
After that the line for keyword in keyword_list.split(', '):
will return the same string because there is no ", " inside the mentioned row, so keyword is the same as keyword list.
This line keyword = keyword.split('-')[0]
will return the word [paths
because the entire thing you wrote was string I suppose. So this new word is definitely not matching your keys in the keyword_label_mappings
dictionary and it will return []
at the end.
First of all try debugging your code, if you are using debugging tools better and if not use print
statements to see the values of splits and etc. Also you have to check all the values and should try to find some pattern between data in order to find better algorithm of splitting and getting the key you want for checking inside the dictionary.
Also it would be better to provide difference.head()
to get better idea of your df
if this one still does not work.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论