为什么下面的Apache Beam代码返回不同的输出?

huangapple go评论137阅读模式
英文:

Why below apache Beam code return different output?

问题

使用yield语句时,SplitWords函数会生成一个迭代器,该迭代器包含所有从输入文本中分割出的单词。在第一个代码示例中,yield word语句会将每个单词逐个产生,然后传递给beam.Map(print)函数,最终将它们打印出来。

而当你使用return [word]语句时,它会返回一个包含单词的列表,而不是一个迭代器。在第二个代码示例中,beam.Map(print)函数期望的是一个迭代器,但由于你使用了return语句,它实际上得到的是一个包含单词的列表。因此,beam.Map(print)只会打印出列表中的第一个单词(Strawberry)和最后一个单词(Tomato)。这就是为什么输出只包含这两个单词的原因。

英文:

I am learning apache beam with Python SDK.
I cam through a a code on beam website, and I have come up with some doubts.
Below is code from apache website.

import apache_beam as beam
import re

class SplitWords(beam.DoFn):
 def __init__(self, delimiter=','):
   self.delimiter = delimiter

 def process(self, text):
   for word in text.split(self.delimiter):
     yield  word

with beam.Pipeline() as pipeline:
 plants = (
     pipeline
     | 'Gardening plants' >> beam.Create([
         '🍓Strawberry,🥕Carrot,🍆Eggplant',
         '🍅Tomato,🥔Potato',
     ])
     | 'Split words' >> beam.ParDo(SplitWords(','))
     | beam.Map(print))

This producing following output.

🍓Strawberry

🥕Carrot

🍆Eggplant

🍅Tomato

🥔Potato

From the documentation of Beam about return statement of Pardo:
You can also use a return statement with an iterable, like a list or a generator

But when I change the code to use return instead of yield ,it produces following output.

import apache_beam as beam
import re

class SplitWords(beam.DoFn):
 def __init__(self, delimiter=','):
   self.delimiter = delimiter

 def process(self, text):
   for word in text.split(self.delimiter):
     *return  [word]*

with beam.Pipeline() as pipeline:
 plants = (
     pipeline
     | 'Gardening plants' >> beam.Create([
         '🍓Strawberry,🥕Carrot,🍆Eggplant',
         '🍅Tomato,🥔Potato',
     ])
     | 'Split words' >> beam.ParDo(SplitWords(','))
     | beam.Map(print))

Output:

🍓Strawberry

🍅Tomato

Why its like that?

答案1

得分: 2

你可以使用 return 而不是 yield,但在这个示例中,目标是返回一个列表。这类似于 flatMap 操作。

对于 yield,你可以为每个要返回的元素重复使用它。

如果你想使用 return 替代,你需要构建一个列表并返回它,示例代码如下:

def test_pipeline(self):
    with TestPipeline() as p:
        class SplitWords(beam.DoFn):
            def __init__(self, delimiter=';'):
                self.delimiter = delimiter

            def process(self, text):
                plants = []
                for word in text.split(self.delimiter):
                    plants.append(word)

                return plants
            
        plants = (
                p
                | 'Gardening plants' >> beam.Create([
            '🍓Strawberry,🥕Carrot,🍆Eggplant',
            '🍅Tomato,🥔Potato',
        ])
                | 'Split words' >> beam.ParDo(SplitWords(';'))
                | beam.Map(print))

我们有预期的结果:

🍓Strawberry
🥕Carrot
🍆Eggplant
🍅Tomato
🥔Potato
英文:

You can use return instead of yield, but in this example the goal is to return a list. It's like a flatMap operation.

for the yield, you can repeat it for each element to return.

If you want to use return instead, you have to use and build a list and return it, example :

def test_pipeline(self):
    with TestPipeline() as p:
        class SplitWords(beam.DoFn):
            def __init__(self, delimiter=','):
                self.delimiter = delimiter

            def process(self, text):
                plants = []
                for word in text.split(self.delimiter):
                    plants.append(word)

                return plants
            
        plants = (
                p
                | 'Gardening plants' >> beam.Create([
            '🍓Strawberry,🥕Carrot,🍆Eggplant',
            '🍅Tomato,🥔Potato',
        ])
                | 'Split words' >> beam.ParDo(SplitWords(','))
                | beam.Map(print))

We have the expected result :

🍓Strawberry
🥕Carrot
🍆Eggplant
🍅Tomato
🥔Potato

答案2

得分: 0

BTW: never mix return and yield. Check https://github.com/apache/beam/issues/22969 we are working on.

英文:

BTW: never mix return and yield. Check https://github.com/apache/beam/issues/22969 we are working on.

huangapple
  • 本文由 发表于 2023年3月10日 00:33:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75687491.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定