问题

我的要求是从一个BQ表中读取数据，进行一些处理，选择基于“score”列的前N行，并将其写入BQ表，同时将这些行发布为PubSub消息。

我制作了下面的示例来创建一个PCollection，并基于“score”的值选择这个PCollection中的前5行。

import apache_beam as beam

with beam.Pipeline() as p:
    elements = (
      p
        | beam.Create([{"name": "Bob", "score": 5}, 
                       {"name":"Adam", "score": 5},
                       {"name":"John", "score": 2},
                       {"name":"Jose", "score": 1},
                       {"name":"Tim", "score": 1},
                       {"name":"Bryan", "score": 1},
                       {"name":"Kim", "score":1}])
        | "Filter Top 5 scores" >> beam.combiners.Top.Of(5, key=lambda t: t['score']) 
        | "Print" >> beam.Map(print)
        )
   # 写入BQ
   # 发布到PubSub

这返回的是一个列表，而不是一个PCollection。因此，以这种格式，我无法将其写回BQ表，也无法发布到PubSub。

选择前N个元素但保持它们作为PCollection的最佳方法是什么？

在我的实际用例中，我可能有大约200万行数据，需要根据一个列选择80万条记录。此外，在其中一个Apache Beam峰会的视频中，我记得听说Top函数将结果保存在一个工作节点上，而不是分布式保存。因此，我认为它返回一个列表。在此之前，它可能会中断的最大行数是多少？

英文:

My requirement is to read from a BQ table, do some processing, select the Top N rows on the "score" column and write it to the BQ Table and also publish the rows as a PubSub message.

I made a sample below to create a PCollection and select top 5 rows from this PCollection based on the value of "score".

import apache_beam as beam

with beam.Pipeline() as p:
    elements = (
      p
        | beam.Create([{&quot;name&quot;: &quot;Bob&quot;, &quot;score&quot;: 5}, 
                       {&quot;name&quot;:&quot;Adam&quot;, &quot;score&quot;: 5},
                       {&quot;name&quot;:&quot;John&quot;, &quot;score&quot;: 2},
                       {&quot;name&quot;:&quot;Jose&quot;, &quot;score&quot;: 1},
                       {&quot;name&quot;:&quot;Tim&quot;, &quot;score&quot;: 1},
                       {&quot;name&quot;:&quot;Bryan&quot;, &quot;score&quot;: 1},
                       {&quot;name&quot;:&quot;Kim&quot;, &quot;score&quot;:1}])
        | &quot;Filter Top 5 scores&quot; &gt;&gt; beam.combiners.Top.Of(5, key=lambda t: t[&#39;score&#39;]) 
        | &quot;Print&quot; &gt;&gt; beam.Map(print)
        )
   # Write to BQ
   # Publish to PubSub

This returns a list instead of a PCollection. Hence I'm unable to write it back to BQ table nor publish to PubSub in this format.

What is the best way to select the top N elements but keep them as a PCollection?

In my real use case I might have around 2 million rows and I need to select 800k records from it based on a column.
Also in one of the Apache Beam summit videos I remember hearing that the Top function will keep the results in one worker node instead of keeping it across. So thats why i assume it is a list. What would be the maximum number of the rows before which it could break?

答案1

得分: 2

为解决您的问题，您可以在 Top 操作之后添加一个 FlatMap 转换：

def test_pipeline(self):
    with TestPipeline() as p:
        elements = (
            p
            | beam.Create([{"name": "Bob", "score": 5},
                           {"name": "Adam", "score": 5},
                           {"name": "John", "score": 2},
                           {"name": "Jose", "score": 1},
                           {"name": "Tim", "score": 1},
                           {"name": "Bryan", "score": 1},
                           {"name": "Kim", "score": 1}])
            | "Filter Top 5 scores" >> beam.combiners.Top.Of(5, key=lambda t: t['score'])
            | "Flatten" >> beam.FlatMap(lambda e: e)
            | "Print" >> beam.Map(print)
        )

在这种情况下，结果是 Dict 的 PCollection，而不是 List[Dict] 的 PCollection。

在处理大数据量时，需要小心，因为 beam.combiners.Top.Of 操作只在一个 worker 中执行。

如果需要的话，您还可以通过 Dataflow Prime 进行优化，使您的 Dataflow 作业更加强大（适当调整，工作节点内的垂直自动缩放等等）。

英文:

To solve your issue, you can add a FlatMap transformation after the Top operator :

def test_pipeline(self):
    with TestPipeline() as p:
        elements = (
                p
                | beam.Create([{&quot;name&quot;: &quot;Bob&quot;, &quot;score&quot;: 5},
                               {&quot;name&quot;: &quot;Adam&quot;, &quot;score&quot;: 5},
                               {&quot;name&quot;: &quot;John&quot;, &quot;score&quot;: 2},
                               {&quot;name&quot;: &quot;Jose&quot;, &quot;score&quot;: 1},
                               {&quot;name&quot;: &quot;Tim&quot;, &quot;score&quot;: 1},
                               {&quot;name&quot;: &quot;Bryan&quot;, &quot;score&quot;: 1},
                               {&quot;name&quot;: &quot;Kim&quot;, &quot;score&quot;: 1}])
                | &quot;Filter Top 5 scores&quot; &gt;&gt; beam.combiners.Top.Of(5, key=lambda t: t[&#39;score&#39;])
                | &quot;Flatten&quot; &gt;&gt; beam.FlatMap(lambda e: e)
                | &quot;Print&quot; &gt;&gt; beam.Map(print)
        )

In this case the result is a PCollection of Dict instead of a PCollection of List[Dict].

You have to be carreful with high volume because the beam.combiners.Top.Of operation is done only in one worker.

You can also optimize and make your Dataflow job more powerful with Dataflow Prime if needed (Right Fitting, Vertical Autoscaling inside a Worker...).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Apache Beam查找前N个元素Python SDK

问题

答案1

Jupyter packages

处理pip安装依赖冲突

将类型为list[]的列转换为字符串在polars中

如何在使用pdf2image和table-transformers时将图像坐标转换为PDF坐标？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论