英文:
Google app engine datastore query with cursor won't iterate all items
问题
在我的应用程序中,我有一个带有过滤器的数据存储查询,例如:
datastore.NewQuery("sometype").Filter("SomeField<", 10)
我正在使用游标来迭代结果的批次(例如在不同的任务中)。如果在迭代过程中更改了SomeField
的值,游标将无法在Google App Engine上工作(在devappserver上正常工作)。
我在这里有一个测试项目:https://github.com/fredr/appenginetest
在我的测试中,我运行了/db
,它将使用值为0设置10个项目的数据库,然后运行/run/2
,它将迭代所有值小于2的项目,每次批量处理5个,并将每个项目的值更新为2。
在我的本地devappserver上的结果(所有项目都已更新):
在appengine上的结果(只有五个项目已更新):
我做错了什么吗?这是一个错误吗?还是这是预期的结果?
在文档中指出:
对于使用不等式过滤器或对具有多个值的属性进行排序的查询,游标并不总是按预期工作。
英文:
In my application I have a datastore query with a filter, such as:
datastore.NewQuery("sometype").Filter("SomeField<", 10)
I'm using a cursor to iterate batches of the result (e.g in different tasks). If the value of SomeField
is changed while iterating over it, the cursor will no longer work on google app engine (works fine on devappserver).
I have a test project here: https://github.com/fredr/appenginetest
In my test I ran /db
that will setup the db with 10 items with their values set to 0, then ran /run/2
that will iterate over all items where the value is less than 2, in batches of 5, and update the value of each item to 2.
The result on my local devappserver (all items are updated):
The result on appengine (only five items are updated):
Am I doing something wrong? Is this a bug? Or is this the expected result?
In the documentation it states:
> Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values.
答案1
得分: 3
问题是游标的性质和实现方式。游标包含最后处理的实体的键(编码),因此如果在执行查询之前将游标设置为查询,数据存储将跳转到游标中指定的键所表示的实体,并从该点开始列出实体。
让我们来看看你的情况。
你的查询过滤器是Value<2
。你遍历查询结果的实体,并将Value
属性更改(并保存)为2
。请注意,Value=2
不满足过滤器Value<2
。
在下一次迭代(下一批)中,存在一个游标,你正确地应用了它。因此,当数据存储执行查询时,它会跳转到上一次迭代中处理的最后一个实体,并希望列出在此之后的实体。但是,游标指向的实体可能已经不满足过滤器;因为其新值2
的索引条目很可能已经更新(非确定性行为-请参阅最终一致性以获取更多详细信息,该概念适用于此处,因为你没有使用祖先查询来保证强一致性的结果;time.Sleep()
延迟只会增加这种情况发生的概率)。
因此,数据存储发现最后处理的实体不满足过滤器,并且不会再次搜索所有实体,而是报告没有更多与过滤器匹配的实体,因此不会更新更多实体(也不会报告错误)。
建议:不要同时使用游标和过滤器或按相同属性进行排序。
顺便说一下:
你引用的Appengine文档部分:
> Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values.
这不是你想的那样。这意味着:游标可能无法正确处理具有多个值的属性,并且相同的属性要么包含在不等式过滤器中,要么用于按结果排序。
顺便说一下:
在你的截图中,你使用的是SDK 1.9.17。最新的SDK版本是1.9.21。你应该更新它并始终使用最新可用的版本。
实现目标的替代方法
1)不使用游标
如果你有很多记录,你将无法在一步(一个循环)中更新所有实体,但假设你更新了300个实体。如果重复查询,已经更新的实体将不会出现在再次执行相同查询的结果中,因为更新的Value=2
不满足过滤器Value<2
。只需重复查询+更新,直到查询没有结果为止。由于你的更改是幂等的,如果实体的索引条目的更新被延迟并且会多次返回查询结果,这不会造成任何损害。最好延迟执行下一个查询以最小化这种情况的发生几率(例如,在重新执行查询之间等待几秒钟)。
**优点:**简单。你已经有了解决方案,只需排除游标处理部分。
**缺点:**某些实体可能会被多次更新(因此更改必须是幂等的)。此外,对实体执行的更改必须是将实体从下一个查询中排除的内容。
2)使用任务队列
你可以首先执行一个仅包含键的查询,并将更新推迟到使用任务。你可以创建任务,每个任务传递100个键,任务可以通过键加载实体并进行更新。这将确保每个实体只会被更新一次。这种解决方案由于涉及任务队列而会有稍微更长的延迟,但在大多数情况下这不是问题。
**优点:**没有重复的更新(因此更改可能是非幂等的)。即使要执行的更改不会将实体从下一个查询中排除,也可以正常工作(更通用)。
**缺点:**复杂性较高。延迟较大。
3)使用Map-Reduce
你可以使用Map-Reduce框架/工具对许多实体进行大规模并行处理。不确定Go语言是否已经实现了这个功能。
**优点:**并行执行,可以处理数百万或数十亿个实体。在实体数量较大的情况下速度更快。还具有2)使用任务队列列出的优点。
**缺点:**复杂性较高。可能尚未在Go中可用。
英文:
The problem is the nature and implementation of the cursors. The cursor contains the key of the last processed entity (encoded), and so if you set a cursor to your query before executing it, the Datastore will jump to the entity specified by the key encoded in the cursor, and will start listing entities from that point.
Let's examine your case
Your query filter is Value<2
. You iterate over the entities of the query result, and you change (and save) the Value
property to 2
. Note that Value=2
does not satisfy the filter Value<2
.
In the next iteration (next batch) a cursor is present which you apply properly. Therefore when the Datastore executes the query, it jumps to the last entity processed in the previous iteration, and wants to list entities that come after this. But the entity pointed by the cursor may already not satisfy the filter; because the index entry for its new Value 2
will most likely be already updated (non-deterministic behavior - see eventual consistency for more details which applies here because you did not use an Ancestor query which would guarantee strongly consistent results; the time.Sleep()
delay just increases the probability of this).
So the Datastore sees that the last processed entity does not satisfy the filter and will not search all the entities again but report that no more entities are matching the filter, hence no more entities will be updated (and no errors wil be reported).
Suggestion: don't use cursors and filter or sort by the same property you're updating at the same time.
By the way:
The part from the Appengine docs you quoted:
> Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values.
This is not what you think. This means: cursors may not work properly on a property which has multiple values AND the same property is either included in an inequality filter or is used to sort the results by.
By the way #2
In the screenshot you are using SDK 1.9.17. The latest SDK version is 1.9.21. You should update it and always use the latest available version.
Alternatives to achieve your goal
1) Don't use cursors
If you have many records, you won't be able to update all your entities in one step (in one loop), but let's say you update 300 entities. If you repeat the query, the already updated entities will not be in the results of executing the same query again because the updated Value=2
does not satisfy the filter Value<2
. Just redo the query+update until the query has no results. Since your change is idempotent, it would not cause any harm if the update of the index entry of an entity is delayed and would get returned by the query multiple times. It would be best to delay the execution of the next query to minimize the chance of this (e.g. wait a few seconds between redoing the query).
Pros: Simple. You already have the solution, just exclude the cursor handling part.
Cons: Some entities might get updated multiple times (therefore the change must be idempotent). Also the change performed on entities must be something which will exclude the entity from the next query.
2) Using Task Queue
You could first execute a keys-only query and defer the update to using tasks. You could create tasks with let's say passing 100 keys to each, and the tasks could load the entities by key and do the update. This would ensure each entity would only get updated once. This solution would have a little bigger delay due to involving the task queue, but that is not a problem in most cases.
Pros: No duplicated updates (therefore change may be non-idempotent). Works even if the change to be performed would not exclude the entity from the next query (more general).
Cons: Higher complexity. Bigger lag/delay.
3) Using Map-Reduce
You could use the map-reduce framework/utility to do massively parallel processing of many entities. Not sure if it has been implemented in Go.
Pros: Parallel execution, can handle even millions or billions of entities. Much faster in case of large entity number. Plus pros listed at 2) Using Task Queue.
Cons: Higher complexity. Might not be available in Go yet.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论