How to approach to design a near real-time data stream filtering solution to serve large data to a web client?

huangapple go评论77阅读模式
英文:

How to approach to design a near real-time data stream filtering solution to serve large data to a web client?

问题

我在向 Web 客户端传递一个包含一些用户统计事件的大型 CSV 文件时遇到了问题,需要对数据进行处理和过滤以满足一些指标。由于数据源始终会有数据,所以这应该是近乎实时的。

我已经进行了一些调查,但我不确定如何解决这个问题。我听说过 "数据流式传输" 或 "基于推送的数据流式传输",但我不确定如何例如使用 Kafka 来在将结果提供给消费者之前在代理中进行流处理或过滤。

在解决这个问题时,作为第一步,我计划使用 "分割器(spliterator)" 将该文件分成块,然后将这些块发送到分区中,但是这就是让我困惑的地方:过滤应该在何处进行?

我尽量用最好的方式解释一下令人困惑的部分:好的,我使用块读取文件,以免出现内存异常,但是为了应用一些过滤,因为数据是无序的,我认为我需要整个数据,这可能再次导致内存异常。所以我不确定是否需要在分区中的每个块中应用过滤器,然后合并结果,在这种情况下,我认为我需要再次对这些合并结果应用相同的过滤器。在这种情况下,使用 Kafka 进行数据流处理时是否是这个想法?

为了使问题更具体,假设这是用户活动数据,我需要找到用户会话的平均时长。在这种情况下,我在几个分区的块中散布了用户会话。我是否需要在每个分区的每个块中找到平均值,然后再次计算?或者如果我需要筛选后续的用户,在这种情况下我如何累积结果?

英文:

I have a problem of delivering a large CSV file which contains some user statistics events to a web client and by doing it I need to process and filter data to fulfill some metrics. This should be near realtime since data source will always have data.

I have investigated a bit, however I'm not sure how to approach this problem. I hear data streaming or push-based data streaming but I'm not sure how I can use for example Kafka to do stream processing or filtering in the broker before giving results to consumer.

When I approach this problem, as a first step I'm planning to split that file into chunks by utilizing a spliterator, then send this chunks into partitions, but this is the part that I'm confused: How and where filtering happens?

Let me explain the confusing part as best as I can do: Ok, I read file with chunks not to have out of memory exception, but to apply some filter, since the data is unsorted I think I need whole data which could result to memory exception again. So I'm not sure if I need to apply filter to each chucks in partitions and merge the results and in this case I think I need to apply same filter again to this merged results. Is this the idea in data stream processing when using Kafka in this case?

To make it more concrete, let's say this is the user activities data and I need to find average length of user sessions. In this case, I have user sessions scattered through chunks in several partitions. Should I need to find average in each chunk in each partition and and calculate again? Or If I need to filter followed users and in this case how can I accumulate results?

答案1

得分: 1

过滤不会在代理层级进行。如果你计划使用Kafka Streams,例如,你需要创建一个单独的应用程序,该应用程序执行过滤和聚合逻辑。您可以逐行读取文件并将其发送到Kafka。您的应用程序将从主题中读取数据并执行过滤。如果您需要计算每个用户的平均会话数,您应该将用户的标识符设置为键,以便具有相同ID的用户将进入同一个分区。在这种情况下,您可以有多个应用程序实例,每个实例将从其分区读取并计算统计信息。

问题在于您的任务涉及批处理而不是流式处理。因此,很难理解文件的末尾以及何时应该停止处理。在流式处理中,通常使用时间窗口来计算统计信息。

另一个可能性是在KSQL中实现您的逻辑。

希望这能给您一个继续的思路。

英文:

Filtering does not happen at the level of the broker. If you plan to use Kafka Streams, for example, you need to create a separate application, which performs the filtering and aggregation logic. You can read your file and send it to Kafka line by line. Your application will read the data from the topics and perform the filtering. If you need to calculate the average session per user, you should set the identifier of the user as a key, so that the users with the same id will go to the same partition. In this case you can have several instances of the application, each of them will read from their partitions and calculate the statistics.

The problem is that your task is about processing in batch and not streaming. So it will be hard to understand where is the end of the file and that you should stop processing. In streaming you typically calculate statistics using time windows.

Another possibility is to have you logic implemented in KSQL.

Hope this gives you an idea how to move on.

huangapple
  • 本文由 发表于 2020年8月26日 20:15:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/63597454.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定