问题

我可以使用SciPy在我的机器上对文本进行分类，但我需要对来自HTTP POST请求的字符串对象进行分类，以实现接近实时的输出。如果我的目标是高并发、接近实时的输出和小内存占用，我应该研究哪些算法？我觉得我可以使用Go语言中的支持向量机（SVM）实现，但这是否是我使用情况的最佳算法？

英文:

I can use SciPy to classify text on my machine, but I need to categorize string objects from HTTP POST requests at, or in near, real time. What algorithms should I research if my goals are high concurrency, near real-time output and small memory footprint? I figured I could get by with the Support Vector Machine (SVM) implementation in Go, but is that the best algorithm for my use case?

答案1

得分: 1

是的，使用线性核的支持向量机（SVM）应该是一个很好的起点。你可以使用scikit-learn（它包装了liblinear）来训练你的模型。在模型训练完成后，模型就是一个简单的列表，其中包含每个要分类的类别的特征:权重。假设你只有3个类别，类似于这样：

class1[feature1] = weight11
class1[feature2] = weight12
...
class1[featurek] = weight1k    ------- 对于类别1

... 不同的<feature, weight> ------ 对于类别2
... 不同的<feature, weight> ------ 对于类别3，等等

在预测时，你完全不需要使用scikit-learn，你可以使用服务器后端所使用的任何编程语言进行线性计算。假设一个特定的POST请求包含特征(feature3, feature5)，你需要做的是：

linear_score[class1] = 0
linear_score[class1] += 在类别1中查找feature3的权重
linear_score[class1] += 在类别1中查找feature5的权重

linear_score[class2] = 0
linear_score[class2] += 在类别2中查找feature3的权重
linear_score[class2] += 在类别2中查找feature5的权重

..... 对于类别3也是同样的操作
选择具有最高linear_score的类别，即class1、class2或class3。

更进一步：如果你能够定义特征权重的方式（例如，使用标记的tf-idf分数），那么你的预测可以变成：

linear_score[class1] += class1[feature3] x feature_weight[feature3]
以此类推。

注意，feature_weight[feature k]通常对于每个请求都是不同的。由于对于每个请求，活跃特征的总数必须远小于考虑的特征总数（考虑50个标记或特征与你的整个词汇表的1百万个标记），所以预测应该非常快速。我可以想象，一旦你的模型准备好了，预测的实现可以基于键值存储（例如redis）来编写。

英文:

Yes, SVM (with a linear kernel) should be a good starting point. You can use scikit-learn (it wraps liblinear I believe) to train your model. After the model is learned, the model is simply a list of feature:weight for each category you want to classifying into. Something like this (suppose you have only 3 classes):

class1[feature1] = weight11
class1[feature2] = weight12
...
class1[featurek] = weight1k    ------- for class 1

... different &lt;feature, weight&gt; ------ for class 2
... different &lt;feature, weight&gt; ------ for class 3 , etc

At prediction time, you don't need scikit-learn at all, you can use whatever language you are using on the server backend to do a linear computation. Suppose a specific POST request contains features (feature3, feature5), what you need to do is like this:

linear_score[class1] = 0
linear_score[class1] += lookup weight of feature3 in class1
linear_score[class1] += lookup weight of feature5 in class1

linear_score[class2] = 0
linear_score[class2] += lookup weight of feature3 in class2
linear_score[class2] += lookup weight of feature5 in class2

..... same thing for class3
pick class1, or class2 or class3 whichever has the highest linear_score

One step further: If you could have some way to define the feature weight (e.g., using tf-idf score of tokens), then your prediction could become:

linear_score[class1] += class1[feature3] x feature_weight[feature3]
so on and so forth.

Note feature_weight[feature k] is usually different for each request.
Since for each request, the total number of active features must be much smaller than the total number of considered features (consider 50 tokens or features vs your entire vocabulary of 1 MM tokens), the prediction should be very fast. I can imagine once your model is ready, an implementation of the prediction could be just written based on a key-value store (e.g., redis).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

分类HTTP POST对象的最便宜的方法

问题

答案1

测试 fasthttp 和 httptest

可以使用Go构建并添加额外的构建步骤吗？

你可以使用Go和mgo来使用MongoDB的投影功能。

传递结构体和传递结构体指针之间有什么区别？它们不都是指针吗？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论