英文:
List of lists of dictionaries with the same keys but different values
问题
以下是翻译好的部分:
首先,请允许我要求您不要对这篇帖子进行负面评价。我尝试过其他帖子,其中包含一个“最小可重现示例”,但没有成功。日志太复杂了。到目前为止,没有人能够帮助我。
我需要从防病毒软件的日志中收集特定的键/值对。除了一个键/值对之外,我已经成功收集了所有我需要的键/值对,那就是防病毒软件采取的操作。一切都围绕着“指示器”键展开,其中包含一个包含有关找到的病毒/恶意软件的某些信息的字典列表。这些字典中的每一个都以一个id号码(1、2、3、4…)开头,这个id号码根据病毒/恶意软件的不同而不同。虽然这些字典具有完全相同的结构(和相同的键名),但它们的值不同。请看下面的摘录:
"indicators": [
{
"id": 1,
"type": "detection_name",
"field": "malName",
"value": "HKTL_CAIN",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 2,
"type": "file_sha1",
"field": "fileHash",
"value": "",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 3,
"type": "filename",
"field": "fileName",
"value": "D:\ECIH",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 4,
"type": "fullpath",
"field": "fullPath",
"value": "D:\ECIH",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 5,
"type": "text",
"field": "actResult",
"value": "File cleaned",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 6,
"type": "text",
"field": "scanType",
"value": "Scheduled Scan",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
}
]
请注意,id 1与恶意软件名称相关。id 2是哈希值,id 3是文件名,等等。我感兴趣的是id 5,其中包含防病毒软件采取的操作。有一个很长的可能操作列表,但为了举例说明,操作有'File cleaned'和'File quarantined.'。操作总是在键'value'中找到,但问题是'value'出现在任何地方。我注意到我需要的'value'(服务器操作)总是与'field'中的'actResult'值配对出现,'actResult'也出现在任何地方。
{
"id": 5,
"type": "text",
"field": "actResult",
"value": "File cleaned",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
}
另一个问题是,这些日志的长度并不总是相同的,因此有些可能有id1、id2,..id5,而其他可能有11个id。没有什么是一致的。不管怎样,我需要做的是捕获我需要的值,并将它们放入数据框中,但是由于'value'键到处都有,脚本有问题。
在我提供的示例中,总共有10条记录,所以我将有10个ID。但由于病毒类型不同,病毒的信息也会发生变化,所以'指示器'也会发生变化。因此,在记录缺失的地方,我将其替换为' - '。但由于'value'键到处都有,最终我得到了不均匀数量的ID/操作。
请参考这里的日志(https://codeshare.io/j0yX1A)。下面是脚本:
actions = ['File cleaned', 'File deleted', 'File quarantined']
actions_list = []
action_list = []
id = [id['id'] for id in logs]
print(id)
for log in logs:
for indicator in log['indicators']:
if indicator['value'] in actions:
action_list.append(indicator['value'])
else:
action_list.append('-')
print(action_list)
当前输出:
如您所见,当前脚本捕获了所有'value'键,而不仅仅是那些在操作列表中的值。
预期输出
如果没有键/值,将其替换为' - '。
['WB-13273-20230604-00000', 'WB-13273-20230603-00000', 'WB-13273-20230601-00001', 'WB-13273-20230601-00000', 'WB-13273-20230529-00000', 'WB-13273-20230526-00000', 'WB-13273-20230523-00001', 'WB-13273-20230523-00000', 'WB-13273-20230510-00002', 'WB-13273-20230510-00003']
[' - ',' - ','File cleaned', 'File cleaned', ' - ','File cleaned', ' - ','File quarantined', 'File quarantined', 'File quarantined']
相同数量的ID和操作。
那么,我如何从'field'键中收集操作值,并将缺失的记录替换为' - ',同时忽略不需要的其他'value'键?
英文:
Firs off, let me preface this by asking you not to downvote the post. I've tried other posts with a 'minimal reproducible example' but it didn't work. The log is too complex. So far no one's been able to help.
I need to collect certain key/value pairs from logs from the antivirus. I’ve been able to collect all key/value pairs I need except for one, the action taken by the antivirus.
Everything revolves around the ‘indicators’ key, which contains a list of dicts with each containing a certain piece of information about the virus/malware found. Each of these dicts starts with an id number (1,2,3,4…) which varies depending on the virus/malware. While these dicts have the exact same structure (and same key names), their values differ. Take a gander at the excerpt below:
"indicators": [
{
"id": 1,
"type": "detection_name",
"field": "malName",
"value": "HKTL_CAIN",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 2,
"type": "file_sha1",
"field": "fileHash",
"value": "",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 3,
"type": "filename",
"field": "fileName",
"value": "D:\\ECIH",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 4,
"type": "fullpath",
"field": "fullPath",
"value": "D:\\ECIH",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 5,
"type": "text",
"field": "actResult",
"value": "File cleaned",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
},
{
"id": 6,
"type": "text",
"field": "scanType",
"value": "Scheduled Scan",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
}
]
Note that id 1 relates to the malware name. Id 2 is the hash value, id 3 is the file name, etc. What I'm interested in is in id 5, which contains the action taken by the antivirus. There is a long list of possible actions, but for exemplification purposes, the actions are 'File cleaned'
and 'File quarantined.'
The action is always found in the key 'value'
, but the problem is that 'value'
appears everywhere. I noticed that the 'value' I need (server action) is always paired with the 'actResult'
value in the 'field'
, which also appears everywhere.
{
"id": 5,
"type": "text",
"field": "actResult",
"value": "File cleaned",
"relatedEntities": [
"C888A5B2"
],
"filterIds": [
"a665ee2c"
],
"provenance": [
"Alert"
]
}
Another issue is that these logs aren't always the same length, so some have id1, id2,..id5 whereas others might have 11 ids. Nothing is consistent. Regardless, what I need to do is to capture the values I need and put them into a dataframe, but given that the 'value' key appears everywhere, the script is faulty.
In the sample I provide, there are 10 records total, so I'll have 10 IDs. But since the info on the virus changes based on the virus type, so do the 'indicators'. Hence, where a record is missing, I replace it with a ' - '. But since the 'value' key appears everywhere, I end up with an uneven number of ID/Action.
Please refer to the log here (https://codeshare.io/j0yX1A). The script is below:
actions = ['File cleaned', 'File deleted', 'File quarantined']
actions_list = []
action_list = []
id = [id['id'] for id in logs]
print(id)
for log in logs:
for indicator in log['indicators']:
if indicator['value'] in actions:
action_list.append(indicator['value'])
else:
action_list.append('-')
print(action_list)
As you can see, the current script picks up all 'value' keys rather than just the ones whose values are in the actions list.
Expected Output
If there is no key/value, replace it with a ' - '.
['WB-13273-20230604-00000', 'WB-13273-20230603-00000', 'WB-13273-20230601-00001', 'WB-13273-20230601-00000', 'WB-13273-20230529-00000', 'WB-13273-20230526-00000', 'WB-13273-20230523-00001', 'WB-13273-20230523-00000', 'WB-13273-20230510-00002', 'WB-13273-20230510-00003']
[' - ',' - ','File cleaned', 'File cleaned', ' - ','File cleaned', ' - ','File quarantined', 'File quarantined', 'File quarantined']
Same number of IDs and actions.
So how can I collect the action value from the "field" key and replace the missing records with a ' - ' while ignoring the other 'value' keys that aren't needed?
答案1
得分: 1
看起来你可能正在寻找for
循环的else
子句,以处理没有"actResults"的情况。我将这称为"无中断"子句,因为else
是在for
循环未执行break
时发生的操作。
鉴于你的数据:
import json
with open("log.json", "r") as file_in:
log_data = json.load(file_in)
action_list = []
for log_entry in log_data:
for indicator in log_entry["indicators"]:
if indicator.get("field") == "actResult":
action_list.append((log_entry["id"], indicator["value"]))
break
else:
action_list.append((log_entry["id"], "--"))
for action in action_list:
print(action)
将返回你的10个列表:
('WB-13273-20230601-00001', '--')
('WB-13273-20230601-00000', '--')
('WB-13273-20230529-00000', '--')
('WB-13273-20230526-00000', 'File cleaned')
('WB-13273-20230523-00001', '--')
('WB-13273-20230523-00000', '--')
('WB-13273-20230510-00002', 'File quarantined')
('WB-13273-20230510-00003', 'File quarantined')
('WB-13273-20230510-00001', 'File quarantined')
('WB-13273-20230510-00000', 'File quarantined')
英文:
It looks like you might be looking for the else
clause of the for
loop to account for the cases where you have not "actResults". I call this the no break
clause as the else
is what happens in the event that the for
loop did not do a break
.
Given your data:
import json
with open("log.json", "r") as file_in:
log_data = json.load(file_in)
action_list = []
for log_entry in log_data:
for indicator in log_entry["indicators"]:
if indicator.get("field") == "actResult":
action_list.append((log_entry["id"], indicator["value"]))
break
else:
action_list.append((log_entry["id"], "--"))
for action in action_list:
print(action)
Will give you back your list of 10:
('WB-13273-20230601-00001', '--')
('WB-13273-20230601-00000', '--')
('WB-13273-20230529-00000', '--')
('WB-13273-20230526-00000', 'File cleaned')
('WB-13273-20230523-00001', '--')
('WB-13273-20230523-00000', '--')
('WB-13273-20230510-00002', 'File quarantined')
('WB-13273-20230510-00003', 'File quarantined')
('WB-13273-20230510-00001', 'File quarantined')
('WB-13273-20230510-00000', 'File quarantined')
答案2
得分: -1
以下是您要求的代码的翻译部分:
actions_to_track = ['File cleaned', 'File deleted', 'File quarantined']
def get_actresult(log):
for ind in log['indicators']:
if ind.get('field') == 'actResult': # 一个动作指示器
if ind.get('value') in actions_to_track:
return ind.get('value')
return '-'
log_ids_to_actresult = {log['id']: get_actresult(log) for log in logs}
# => {'example_id_0': 'File cleaned;'}
在https://codeshare.io/j0yX1A的logs
上运行,会产生以下结果:
{
'WB-13273-20230510-00000': 'File quarantined',
'WB-13273-20230510-00001': 'File quarantined',
'WB-13273-20230510-00002': 'File quarantined',
'WB-13273-20230510-00003': 'File quarantined',
'WB-13273-20230523-00000': '-',
'WB-13273-20230523-00001': '-',
'WB-13273-20230526-00000': 'File cleaned',
'WB-13273-20230529-00000': '-',
'WB-13273-20230601-00000': '-',
'WB-13273-20230601-00001': '-',
}
请注意,这是您提供的代码的翻译部分,没有其他内容。
英文:
Since you're still a relatively new user and you've shown some amount of effort at asking your (somewhat incomplete) question, I'll give you the benefit of the doubt. I think what you're looking for is:
actions_to_track = ['File cleaned', 'File deleted', 'File quarantined']
def get_actresult(log):
for ind in log['indicators']:
if ind.get('field') == 'actResult': # an action indicator
if ind.get('value') in actions_to_track:
return ind.get('value')
return '-'
log_ids_to_actresult = {log['id']: get_actresult(log) for log in logs}
# => {'example_id_0': 'File cleaned'}
Running against logs
from https://codeshare.io/j0yX1A, this produces:
{
'WB-13273-20230510-00000': 'File quarantined',
'WB-13273-20230510-00001': 'File quarantined',
'WB-13273-20230510-00002': 'File quarantined',
'WB-13273-20230510-00003': 'File quarantined',
'WB-13273-20230523-00000': '-',
'WB-13273-20230523-00001': '-',
'WB-13273-20230526-00000': 'File cleaned',
'WB-13273-20230529-00000': '-',
'WB-13273-20230601-00000': '-',
'WB-13273-20230601-00001': '-',
}
However, I'm not sure because your question is unclear and missing key information. My answer is based on the following guesses I've had to make that you should've clarified in your question:
- You've got a list of
log
dictionaries, each with"id"
and"indicators"
keys - There are a variable number of indicators per log. They are all
dict
of the same format - An "action indicator" will have a
"field"
of"actResult"
- For each log, you want to extract the
"value"
from an action indicator if it exists and if the value is one of an explicit list of actions you care about, otherwise use"-"
- The actResult values you care about are
['File cleaned', 'File deleted', 'File quarantined']
A complete but minimal-ish example of the logs and indicator data is:
logs = [
{
"id": "example_id_0",
"indicators": [
{
"id": 1,
"type": "detection_name",
"field": "malName",
"value": "HKTL_CAIN",
},
{
"id": 5,
"type": "text",
"field": "actResult",
"value": "File cleaned",
},
],
},
]
It was good that you included your attempt code. It would've helped if you included annotations explaining what you were trying to do (even if you don't know how) and omitted extraneous information (e.g. about antiviruses or something).
Don't simply dismiss all the users giving you critique as "pedantic". People want to answer your question, but they can only do so if you include relevant information and exclude irrelevant information. Please take the above as an example of how you can ask a better question next time.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论