用于服务器监控的机器学习

huangapple go评论103阅读模式
英文:

Machine learning for monitoring servers

问题

我正在查看pybrain来处理服务器监控警报并确定问题的根本原因。我很满意使用监督学习进行训练并整理训练数据集。数据的结构类似于这样:

 * 服务器类型A #1
  * 警报类型1
  * 警报类型2
 * 服务器类型A #2
  * 警报类型1
  * 警报类型2
 * 服务器类型B #1
  * 警报类型99
  * 警报类型2

因此,有n个服务器,每个服务器有x个可以是“UP”或“DOWN”的警报。n和x都是可变的。

如果A1服务器的警报1和2都是“DOWN”,那么我们可以说该服务器上的服务a已经停止,并且是问题的原因。

如果所有服务器上的警报1都是“DOWN”,那么我们可以说服务a是问题的原因。

可能会有多个可能的原因,因此直接进行分类似乎不太合适。

我还希望将后续的数据源与网络关联起来,例如只是ping一些外部服务的脚本。

由于串行服务检查,可能不会同时触发所有适当的警报,因此可能会先有一个服务器停止,然后5分钟后另一个服务器停止。

我首先尝试做一些基本的事情:

from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer


INPUTS = 2
OUTPUTS = 1

# 构建网络

# 2个输入,3个隐藏,1个输出神经元
net = buildNetwork(INPUTS, 3, OUTPUTS)


# 构建数据集

# 有2个输入和1个输出的数据集
ds = SupervisedDataSet(INPUTS, OUTPUTS)


# 添加一个样本,输入的可迭代对象和输出的可迭代对象
ds.addSample((0, 0), (0,))



# 使用数据集训练网络
trainer = BackpropTrainer(net, ds)

# 训练1000个周期
for x in xrange(10):
    trainer.train()

# 训练直到误差率较低
trainer.trainUntilConvergence()


# 在网络上运行输入
result = net.activate([2, 1])

但是,我很难将可变数量的警报映射到静态数量的输入。例如,如果我们向服务器添加一个警报或添加一个服务器,整个网络都需要重新构建。如果需要这样做,我可以做到,但想知道是否有更好的方法。

我正在考虑的另一个选择是为每种类型的服务器使用不同的网络,但我不知道如何得出整个环境范围的结论,因为它只会对单个主机进行评估,而不是同时对所有主机进行评估。

我应该使用哪种类型的算法,以及如何将数据集映射到整体上以得出整个环境范围的结论?

我非常乐意使用任何可行的算法。Go甚至比Python更好。

英文:

I'm looking at pybrain for taking server monitor alarms and determining the root cause of a problem. I'm happy with training it using supervised learning and curating the training data sets. The data is structured something like this:

 * Server Type **A** #1
  * Alarm type 1
  * Alarm type 2
 * Server Type **A** #2
  * Alarm type 1
  * Alarm type 2
 * Server Type **B** #1
  * Alarm type **99**
  * Alarm type 2

So there are n servers, with x alarms that can be UP or DOWN. Both n and x are variable.

If Server A1 has alarm 1 & 2 as DOWN, then we can say that service a is down on that server and is the cause of the problem.

If alarm 1 is down on all servers, then we can say that service a is the cause.

There can potentially be multiple options for the cause, so straight classification doesn't seem appropriate.

I would also like to tie later sources of data to the net. Such as just scripts that ping some external service.

All the appropriate alarms may not be triggered at once, due to serial service checks, so it can start with one server down and then another server down 5 minutes later.

I'm trying to do some basic stuff at first:

from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer


INPUTS = 2
OUTPUTS = 1

# Build network

# 2 inputs, 3 hidden, 1 output neurons
net = buildNetwork(INPUTS, 3, OUTPUTS)


# Build dataset

# Dataset with 2 inputs and 1 output
ds = SupervisedDataSet(INPUTS, OUTPUTS)


# Add one sample, iterable of inputs and iterable of outputs
ds.addSample((0, 0), (0,))



# Train the network with the dataset
trainer = BackpropTrainer(net, ds)

# Train 1000 epochs
for x in xrange(10):
    trainer.train()

# Train infinite epochs until the error rate is low
trainer.trainUntilConvergence()


# Run an input over the network
result = net.activate([2, 1])

But I[m having a hard time mapping variable numbers of alarms to static numbers of inputs. For example, if we add an alarm to a server, or add a server, the whole net needs to be rebuilt. If that is something that needs to be done, I can do it, but want to know if there's a better way.

Another option I'm trying to think of, is have a different net for each type of server, but I don't see how I can draw an environment-wide conclusion, since it will just make evaluations on a single host, instead of all hosts at once.

Which type of algorithm should I use and how do I map the dataset to draw environment-wide conclusions as a whole with variable inputs?

I'm very open to any algorithm that will work. Go is even better than python.

答案1

得分: 5

这实际上是一个具有挑战性的问题。

标签的表示

对于学习来说,很难表示您的目标标签。正如您所指出的,

如果服务器A1的报警1和2都是DOWN,那么我们可以说该服务器上的服务a已经停止工作,并且是问题的原因。
如果所有服务器上的报警1都是DOWN,那么我们可以说服务a是问题的原因。
可能会有多个原因...

我猜您需要列出所有可能的选项,否则我们不能期望机器学习算法进行泛化。为了简单起见,假设您只有两个可能的问题原因:

1. 服务问题
2. 服务器问题

基于站点的二元分类器

假设在您的第一个机器学习模型中,上述两个原因是唯一的。那么您现在正在处理一个基于站点的二元分类器。也许逻辑回归是一个更好的选择,因为它易于解释。

要找出是哪个服务器出了问题还是哪个服务出了问题,这可以是您的第二步。根据您的示例,解决第二步时:

  • 如果是服务问题,我认为可以手动推导出一些决策规则,以便可以确定服务名称。这个想法是您应该看到触发相同报警的大量服务器,对吗?还可以查看末尾的高级阅读以获取更多选项。
  • 如果是服务器问题,您可以构建一个第二个二元分类器(单个服务器端分类器),它在每个服务器上运行,仅使用来自该服务器的特征,并回答问题:“如果我有问题”。

基于站点的二元分类器的特征

我假设所有这些报警都是您特征的最佳来源。我猜在这里,使用一些摘要统计数据作为特征可能会更有帮助。例如,

  • 接收报警A为DOWN的服务器的百分比
  • 报警B为DOWN的所有服务器的平均时间长度
  • 在所有报警B为DOWN的服务器中,有多少百分比也有报警A为DOWN。

服务器端二元分类器的特征

您应该明确使用所有报警信号作为服务器端分类器的特征。然而,在训练时,您应该使用来自所有服务器的所有数据。标签只是“有问题”或“没有问题”。训练数据将如下所示:

  报警A开启,报警B开启,报警C开启,...,报警Z开启,有问题
    是,            是,            否,            是,      是
    否,            是,            否,            否,      否
    ?,            否,            是,            否,      否

请注意,我使用“?”来表示您可能有缺失数据(未知状态)的一些可能报警,这可以用来描述以下情况:

由于串行服务检查,不一定会立即触发所有适当的报警,
所以可能会先有一个服务器停止工作,然后5分钟后另一个服务器停止工作。

一些高级阅读材料

这个问题涉及到一些主题,例如[报警相关性],[事件相关性],[故障诊断]。
[报警相关性]: https://www.macs.hw.ac.uk/~dwcorne/RSR/eventcorr.pdf
[事件相关性]: http://en.wikipedia.org/wiki/Event_correlation
[故障诊断]: http://ac.els-cdn.com/S0167642304000772/1-s2.0-S0167642304000772-main.pdf?_tid=ded0f458-4096-11e4-b2b6-00000aab0f6c&acdnat=1411197925_6be612d96c7c019ae583dd953df878be

英文:

This is a challenging problem actually.

Representation of labels

It's difficult to represent your target labels for learning. As you pointed out,

If Server A1 has alarm 1 & 2 as DOWN, then we can say that service a is down on that server and is the cause of the problem.
If alarm 1 is down on all servers, then we can say that service a is the cause.
There can potentially be multiple options for the cause ...

I guess you need to list all possible options otherwise we cannot expect an ML algorithm to generalize. To make it simple, let's say you have only two possible causes of the problem:

1. Service problem 
2. Server problem  

Site-wise binary classifier

Suppose in your first ML model, the above are the only two causes. Then you are working on a site-wise binary classifier now. Probably logistic regression is better to get you started since it is easily interpretable.

To find out which server is the problem or which service is the problem, this can be your second step. To solve the second step, based on your example,

  • if it is a service problem, I think some decision rules can be manually derived so that the service name can be pinpointed. The idea is that you should see a significant amount of servers that are triggering the same alarm, right? Also see the advanced readings at the end to check more options.
  • if it is a server problem, you can construct a second binary classifier (an individual server side classifier), which runs on each server using only features coming from that server and answers the question: "if i have problem".

Features for the site-wise binary classifier

I assume all those alarms are the best source of your features. I guess using some summary statistics data as features could help more for the site-wise classifier here. For example,

  • the percentage of servers that are receiving alarm A as DOWN
  • the average length of time across all servers whose alarm B is DOWN
  • across all servers whose alarm B is DOWN, what is the percentage of them that also have alarm A down.
    ...

Features for the server-side binary classifier

You should explicitly use all alarm signals as the features for the server-side classifier. However, at training time, you should take all data from all of the servers. The labels are just "has-problem" or "has-no-problem". The training data will look like:

  alarm A On, alarm B On, alarm C on, ..., alarm Z on, has-problem
    YES,        YES,       NO,               YES,      YES
    NO,         YES,       NO,               NO,       NO
    ?,          NO,        YES,              NO,       NO

Note I used "?" to indicate some possible alarms you might have missing data (unknown state), which can be used to describe the situation below:

All the appropriate alarms may not be triggered at once, 
due to serial service checks,  so it can start with one server down and 
then another server down 5 minutes later.  

Some advanced readings

This problem is related to a few topics, e.g., [alarm correlation], [event correlation], [fault diagnosis].
[alarm correlation]: https://www.macs.hw.ac.uk/~dwcorne/RSR/eventcorr.pdf
[event correlation]: http://en.wikipedia.org/wiki/Event_correlation
[fault diagnosis]: http://ac.els-cdn.com/S0167642304000772/1-s2.0-S0167642304000772-main.pdf?_tid=ded0f458-4096-11e4-b2b6-00000aab0f6c&acdnat=1411197925_6be612d96c7c019ae583dd953df878be

答案2

得分: 3

有几种可选的变量输入方式,但比较简单的有两种:

1)不存在的输入被编码为0.5,而存在的输入被编码为0或1。
2)此外,你可以将输入分为两部分,一部分表示“存在”与“不存在”,另一部分表示“活跃”与“静默”。然后,网络将需要利用这两者之间的交互来学习,即第二列只有在第一列为1时才重要,而在第一列为0时不重要。但是通过足够的训练样本,网络可能能够做到这一点。

当然,这些方法也可以结合使用。

英文:

There are a number of options for variable inputs, but two relatively simple ones are:

  1. inputs which are not present are coded as 0.5, while inputs that are present are coded as either 0 or 1
  2. in addition you could split the input into two, one for "present" vs. "not present", the other for "active" vs. "silent". Then, the network will have to use the interaction between the two to learn that the second column is only important if the first one is 1, and not if the first one is 0. But with enough training cases it can probably do this.

The methods can be combined, of course.

huangapple
  • 本文由 发表于 2014年9月11日 00:20:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/25770429.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定