英文:
Why below apache Beam code return different output?
问题
使用yield
语句时,SplitWords
函数会生成一个迭代器,该迭代器包含所有从输入文本中分割出的单词。在第一个代码示例中,yield word
语句会将每个单词逐个产生,然后传递给beam.Map(print)
函数,最终将它们打印出来。
而当你使用return [word]
语句时,它会返回一个包含单词的列表,而不是一个迭代器。在第二个代码示例中,beam.Map(print)
函数期望的是一个迭代器,但由于你使用了return
语句,它实际上得到的是一个包含单词的列表。因此,beam.Map(print)
只会打印出列表中的第一个单词(Strawberry
)和最后一个单词(Tomato
)。这就是为什么输出只包含这两个单词的原因。
英文:
I am learning apache beam with Python SDK.
I cam through a a code on beam website, and I have come up with some doubts.
Below is code from apache website.
import apache_beam as beam
import re
class SplitWords(beam.DoFn):
def __init__(self, delimiter=','):
self.delimiter = delimiter
def process(self, text):
for word in text.split(self.delimiter):
yield word
with beam.Pipeline() as pipeline:
plants = (
pipeline
| 'Gardening plants' >> beam.Create([
'🍓Strawberry,🥕Carrot,🍆Eggplant',
'🍅Tomato,🥔Potato',
])
| 'Split words' >> beam.ParDo(SplitWords(','))
| beam.Map(print))
This producing following output.
🍓Strawberry
🥕Carrot
🍆Eggplant
🍅Tomato
🥔Potato
From the documentation of Beam about return statement of Pardo:
You can also use a return statement with an iterable, like a list or a generator
But when I change the code to use return instead of yield ,it produces following output.
import apache_beam as beam
import re
class SplitWords(beam.DoFn):
def __init__(self, delimiter=','):
self.delimiter = delimiter
def process(self, text):
for word in text.split(self.delimiter):
*return [word]*
with beam.Pipeline() as pipeline:
plants = (
pipeline
| 'Gardening plants' >> beam.Create([
'🍓Strawberry,🥕Carrot,🍆Eggplant',
'🍅Tomato,🥔Potato',
])
| 'Split words' >> beam.ParDo(SplitWords(','))
| beam.Map(print))
Output:
🍓Strawberry
🍅Tomato
Why its like that?
答案1
得分: 2
你可以使用 return
而不是 yield
,但在这个示例中,目标是返回一个列表。这类似于 flatMap
操作。
对于 yield
,你可以为每个要返回的元素重复使用它。
如果你想使用 return
替代,你需要构建一个列表并返回它,示例代码如下:
def test_pipeline(self):
with TestPipeline() as p:
class SplitWords(beam.DoFn):
def __init__(self, delimiter=';'):
self.delimiter = delimiter
def process(self, text):
plants = []
for word in text.split(self.delimiter):
plants.append(word)
return plants
plants = (
p
| 'Gardening plants' >> beam.Create([
'🍓Strawberry,🥕Carrot,🍆Eggplant',
'🍅Tomato,🥔Potato',
])
| 'Split words' >> beam.ParDo(SplitWords(';'))
| beam.Map(print))
我们有预期的结果:
🍓Strawberry
🥕Carrot
🍆Eggplant
🍅Tomato
🥔Potato
英文:
You can use return
instead of yield
, but in this example the goal is to return a list. It's like a flatMap
operation.
for the yield
, you can repeat it for each element to return.
If you want to use return instead, you have to use and build a list and return it, example :
def test_pipeline(self):
with TestPipeline() as p:
class SplitWords(beam.DoFn):
def __init__(self, delimiter=','):
self.delimiter = delimiter
def process(self, text):
plants = []
for word in text.split(self.delimiter):
plants.append(word)
return plants
plants = (
p
| 'Gardening plants' >> beam.Create([
'🍓Strawberry,🥕Carrot,🍆Eggplant',
'🍅Tomato,🥔Potato',
])
| 'Split words' >> beam.ParDo(SplitWords(','))
| beam.Map(print))
We have the expected result :
🍓Strawberry
🥕Carrot
🍆Eggplant
🍅Tomato
🥔Potato
答案2
得分: 0
BTW: never mix return and yield. Check https://github.com/apache/beam/issues/22969 we are working on.
英文:
BTW: never mix return and yield. Check https://github.com/apache/beam/issues/22969 we are working on.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论