英文:
Cheapest way to classify HTTP post objects
问题
我可以使用SciPy在我的机器上对文本进行分类,但我需要对来自HTTP POST请求的字符串对象进行分类,以实现接近实时的输出。如果我的目标是高并发、接近实时的输出和小内存占用,我应该研究哪些算法?我觉得我可以使用Go语言中的支持向量机(SVM)实现,但这是否是我使用情况的最佳算法?
英文:
I can use SciPy to classify text on my machine, but I need to categorize string objects from HTTP POST requests at, or in near, real time. What algorithms should I research if my goals are high concurrency, near real-time output and small memory footprint? I figured I could get by with the Support Vector Machine (SVM) implementation in Go, but is that the best algorithm for my use case?
答案1
得分: 1
是的,使用线性核的支持向量机(SVM)应该是一个很好的起点。你可以使用scikit-learn(它包装了liblinear)来训练你的模型。在模型训练完成后,模型就是一个简单的列表,其中包含每个要分类的类别的特征:权重
。假设你只有3个类别,类似于这样:
class1[feature1] = weight11
class1[feature2] = weight12
...
class1[featurek] = weight1k ------- 对于类别1
... 不同的<feature, weight> ------ 对于类别2
... 不同的<feature, weight> ------ 对于类别3,等等
在预测时,你完全不需要使用scikit-learn,你可以使用服务器后端所使用的任何编程语言进行线性计算。假设一个特定的POST请求包含特征(feature3, feature5),你需要做的是:
linear_score[class1] = 0
linear_score[class1] += 在类别1中查找feature3的权重
linear_score[class1] += 在类别1中查找feature5的权重
linear_score[class2] = 0
linear_score[class2] += 在类别2中查找feature3的权重
linear_score[class2] += 在类别2中查找feature5的权重
..... 对于类别3也是同样的操作
选择具有最高linear_score的类别,即class1、class2或class3。
更进一步:如果你能够定义特征权重的方式(例如,使用标记的tf-idf分数),那么你的预测可以变成:
linear_score[class1] += class1[feature3] x feature_weight[feature3]
以此类推。
注意,feature_weight[feature k]
通常对于每个请求都是不同的。由于对于每个请求,活跃特征的总数必须远小于考虑的特征总数(考虑50个标记或特征与你的整个词汇表的1百万个标记),所以预测应该非常快速。我可以想象,一旦你的模型准备好了,预测的实现可以基于键值存储(例如redis)来编写。
英文:
Yes, SVM (with a linear kernel) should be a good starting point. You can use scikit-learn (it wraps liblinear I believe) to train your model. After the model is learned, the model is simply a list of feature:weight
for each category you want to classifying into. Something like this (suppose you have only 3 classes):
class1[feature1] = weight11
class1[feature2] = weight12
...
class1[featurek] = weight1k ------- for class 1
... different <feature, weight> ------ for class 2
... different <feature, weight> ------ for class 3 , etc
At prediction time, you don't need scikit-learn at all, you can use whatever language you are using on the server backend to do a linear computation. Suppose a specific POST request contains features (feature3, feature5), what you need to do is like this:
linear_score[class1] = 0
linear_score[class1] += lookup weight of feature3 in class1
linear_score[class1] += lookup weight of feature5 in class1
linear_score[class2] = 0
linear_score[class2] += lookup weight of feature3 in class2
linear_score[class2] += lookup weight of feature5 in class2
..... same thing for class3
pick class1, or class2 or class3 whichever has the highest linear_score
One step further: If you could have some way to define the feature weight (e.g., using tf-idf score of tokens), then your prediction could become:
linear_score[class1] += class1[feature3] x feature_weight[feature3]
so on and so forth.
Note feature_weight[feature k]
is usually different for each request.
Since for each request, the total number of active features must be much smaller than the total number of considered features (consider 50 tokens or features vs your entire vocabulary of 1 MM tokens), the prediction should be very fast. I can imagine once your model is ready, an implementation of the prediction could be just written based on a key-value store (e.g., redis).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论