2022年10月31日 19:00:47go评论107阅读模式

英文:

Scheduler-worker cluster without port forwarding

问题

你好，Satckoverflow！

TLDR：我想重新创建https://github.com/KorayGocmen/scheduler-worker-grpc，但不需要在worker上进行端口转发。

我正在尝试构建一个竞赛编程评测服务器，作为我在学校教授编程给孩子们的项目。

由于评测需要大量计算资源，我希望有多个工作节点。调度器将接收提交的代码，并将其分配给工作节点。为了方便工作节点的部署（因为它们经常会变动），我希望工作节点能够订阅调度器，从而成为一个工作节点并接收任务。

工作节点可能不在与调度器相同的网络上，而且工作节点位于一个虚拟机中（也许以后会迁移到Docker上，但目前存在一些问题）。

调度器应该能够了解工作节点的资源使用情况，向工作节点发送不同类型的任务，并接收结果的流。

我目前考虑使用gRPC来满足工作节点和调度器之间的通信需求。

我可以创建多个调度器服务方法，例如：

注册工作节点，接收任务流
任务结果流，不接收任何内容
定期流式传输工作节点状态，不接收任何内容

然而，我更喜欢以下方式，但不确定是否可行：

调度器的gRPC API：
- 注册一个工作节点（使工作节点的gRPC API对调度器可用）
工作节点的gRPC API：
- 启动一个任务（返回任务状态的流）
- 取消一个任务???
- 获取资源使用情况

如果连接丢失，工作节点应该自动取消注册。

所以我的问题是...如果工作节点位于NAT后面而无需进行端口转发，是否可以创建一个可以注册到调度器以供以后使用的gRPC工作节点API？

额外的可能不必要的信息：

更糟糕的是，我有多种完全不同类型的任务（流式交互式控制台，针对预先准备的测试用例执行代码）。我可能会为不同的任务创建不同的工作节点。

有时，任务涉及到在本地文件系统上有大文件（最多500MB），这些文件通常保存在调度器附近，因此我希望将任务发送到已经从调度器下载了特定文件的工作节点。否则，在其中一个工作节点上下载大文件。在工作节点上同时拥有所有文件将占用超过20GB的空间，因此我希望避免这种情况。

一个工作节点可以同时运行多个任务（最多16个）。

我正在使用Go语言编写该系统。

英文:

Hello Satckoverflow!

TLDR I would like to recreate https://github.com/KorayGocmen/scheduler-worker-grpc without port forwarding on the worker.

I am trying to build a competitive programming judge server for evaluation of submissions as a project for my school where I teach programming to kids.

Because the evaluation is computationally heavy I would like to have multiple worker nodes.
The scheduler would receive submissions and hand them out to the worker nodes. For ease of worker deployment ( as it will be often changing ) I would like the worker to be able to subscribe to the scheduler and thus become a worker and receive jobs.

The workers may not be on the same network as the scheduler + the worker resides in a VM ( maybe later will be ported to docker but currently there are issues with it ).

The scheduler should be able to know resource usage of the worker, send different types of jobs to the worker and receive a stream of results.

I am currently thinking of using grpc to address my requirements of communication between workers and the scheduler.

I could create multiple scheduler service methods like:

register worker, receive a stream of jobs
stream job results, receive nothing
stream worker state periodically, receive nothing

However I would prefer the following but idk whether it is possible:

The scheduler GRPC api:
- register a worker ( making the worker GRPC api available to the scheduler )
The worker GRPC api:
- start a job ( returns stream of job status )
- cancel a job ???
- get resource usage

The worker should unregister automatically if the connection is lost.

So my question is... is it possible to create a grpc worker api that can be registered to the scheduler for later use if the worker is behind a NAT without port forwarding?

Additional possibly unnecessary information:

Making matters worse I have multiple radically different types of jobs ( streaming an interactive console, executing code against prepared testcases ). I may just create different workers for different jobs.

Sometimes the jobs involve having large files on the local filesystem ( up to 500 MB ) that are usually kept near the scheduler therefore I would like to send the job to a worker which already has the specific files downloaded from the scheduler. Otherwise download the large files on one of the workers. Having all files at the same time on the worker would take more than 20 GB therefore I would like to avoid it.

A worker can run multiple jobs ( up to 16 ) at the same time.

I am writing the system in go.

答案1

得分: 0

只要工作人员发起连接，您就不必担心NAT。gRPC支持双向流式传输。这意味着您可以只使用调度程序上的一个服务器来实现所有要求；调度程序无需连接回工作人员。

根据您的描述，您的服务可能如下所示：

syntax = "proto3";
import "google/protobuf/empty.proto";
service Scheduler {
    rpc GetJobs(GetJobsRequest) returns (stream GetJobsResponse) {}
    rpc ReportWorkerStatus(stream ReportWorkerStatusRequest) returns (google.protobuf.Empty) {}
    rpc ReportJobStatus(stream JobStatus) returns (stream JobAction) {}
}
enum JobType {
    JOB_TYPE_UNSPECIFIED = 0;
    JOB_TYPE_CONSOLE = 1;
    JOB_TYPE_EXEC = 2;
}
message GetJobsRequest {
    // 该工作人员愿意接受的作业类型列表。
    repeated JobType types = 1;
}
message GetJobsResponse {
    string jobId = 0;
    JobType type = 1;
    string fileName = 2;
    bytes fileContent = 3;
    // 等等。
}
message ReportWorkerStatusRequest {
    float cpuLoad = 0;
    uint64 availableDiskSpace = 1;
    uint64 availableMemory = 2;
    // 等等。
    // 文件名或文件哈希列表，或其他您需要精确报告文件存在的信息。
    repeated string haveFiles = 2;
}

其中很多内容都是个人偏好的问题（例如，您可以使用oneof而不是枚举），但希望您清楚地知道，从客户端到服务器的单个连接足以满足您的要求。

维护可用工作人员集合非常简单：

func (s *Server) GetJobs(req *pb.GetJobRequest, stream pb.Scheduler_GetJobsServer) error {
    ctx := stream.Context()
    s.scheduler.AddWorker(req)
    defer s.scheduler.RemoveWorker(req)
    for {
        job, err := s.scheduler.GetJob(ctx, req)
        switch {
        case ctx.Err() != nil: // 客户端断开连接
            return nil
        case err != nil:
            return err
        }
        if err := stream.Send(job); err != nil {
            return err
        }
    }
}

基础教程中包含了各种类型的流式传输示例，包括Go中的服务器和客户端实现。

至于注册，通常只需创建某种凭据，工作人员在与服务器通信时将使用该凭据。这可以是随机生成的令牌（服务器可以使用该令牌加载关联的元数据），或者是用户名/密码组合，或者是TLS客户端证书等。具体细节将取决于您的基础架构和设置工作人员时所需的工作流程。

英文:

As long as only the workers initiate the connections you don't have to worry about NAT. gRPC supports streaming in either direction (or both). This means that all of your requirements can be implemented using just one server on the scheduler; there is no need for the scheduler to connect back to the workers.

Given your description your service could look something like this:

syntax = &quot;proto3&quot;;
import &quot;google/protobuf/empty.proto&quot;;
service Scheduler {
    rpc GetJobs(GetJobsRequest) returns (stream GetJobsResponse) {}
    rpc ReportWorkerStatus(stream ReportWorkerStatusRequest) returns (google.protobuf.Empty) {}
    rpc ReportJobStatus(stream JobStatus) returns (stream JobAction) {}
}
enum JobType {
    JOB_TYPE_UNSPECIFIED = 0;
    JOB_TYPE_CONSOLE = 1;
    JOB_TYPE_EXEC = 2;
}
message GetJobsRequest {
    // List of job types this worker is willing to accept.
    repeated JobType types = 1;
}
message GetJobsResponse {
    string jobId = 0;
    JobType type = 1;
    string fileName = 2;
    bytes fileContent = 3;
    // etc.
}
message ReportWorkerStatusRequest {
    float cpuLoad = 0;
    uint64 availableDiskSpace = 1;
    uint64 availableMemory = 2;
    // etc.
    // List of filenames or file hashes, or whatever else you need to precisely
    // report the presence of files.
    repeated string haveFiles = 2;
}

Much of this is a matter of preference (you can use oneof instead of enums, for instance), but hopefully it's clear that a single connection from client to server is sufficient for your requirements.

Maintaining the set of available workers is quite simple:

func (s *Server) GetJobs(req *pb.GetJobRequest, stream pb.Scheduler_GetJobsServer) error {
ctx := stream.Context()
s.scheduler.AddWorker(req)
defer s.scheduler.RemoveWorker(req)
for {
job, err := s.scheduler.GetJob(ctx, req)
switch {
case ctx.Err() != nil: // client disconnected
return nil
case err != nil:
return err
}
if err := stream.Send(job); err != nil {
return err
}
}
}

The Basics tutorial includes examples for all types of streaming, including server and client implementations in Go.

As for registration, that usually just means creating some sort of credential that a worker will use when communicating with the server. This might be a randomly generated token (which the server can use to load associated metadata), or a username/password combination, or a TLS client certificate, or similar. Details will depend on your infrastructure and desired workflow when setting up workers.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无需端口转发的调度器-工作节点集群

问题

答案1

设置环境 – $GOPATH

Docker构建: “go: 在当前目录或任何父目录中找不到go.mod文件”

gosimple/oauth2和Google OAuth2

How to generate and store a uuid type 0x04 in mongodb using golang?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。