英文:
Duplication of vertices in Amazon Neptune
问题
我想在Amazon Neptune中使用Gremlin创建一些逻辑,实现以下操作:
1. 加载包含customer_id和postcode列的数据行。
2. 检查该行的postcode值是否已存在于数据库中:
A. 如果已存在,则为该行的customer_id值创建一个新顶点,然后创建一个新边,将连接从刚创建的customer_id顶点到已存在的postcode顶点。
B. 否则,如果不存在,则为该行的customer_id值创建一个新顶点,为该行的postcode值创建一个新顶点,然后创建一个新边,将连接从刚创建的customer_id顶点到刚创建的postcode顶点。
- 这样做的目的是避免创建重复的顶点。
- 如果您认为我的逻辑存在问题,我愿意尝试不同的方法。
- 我尝试过一些方法,但未能找到一段代码来执行上述所有操作。
- 我正在使用Gremlin。
英文:
I want to create some logic that does the following in Amazon Neptune using Gremlin:
1. Load a row of data that contains customer_id and postcode columns
2. Check if the postcode value from that row already exists in the database:
A. If it does, then create a new vertex for the row's customer_id value and then create a new edge that makes a connection from the customer_id vertex that has just been created to the pre-existing postcode vertex
B. Else, if it does not, then create a new vertex for the row's customer_id value, create a new vertex for the row's postcode value and then create a new edge that makes a connection from the customer_id vertex that has just been created to the postcode vertex that has just been created
- The purpose of this is to avoid creating duplicate vertices.
- I am open to different approaches if you can see flaws in my logic.
- I have tried a few methods but I've been unable to get a single piece of logic to perform all of the above.
- I am using Gremlin.
答案1
得分: 1
First, if you want to ensure uniqueness, each vertex and edge in a graph in Neptune must have a unique ID. So it is good practice to leverage that concept to the fullest. Deterministic IDs are also great for fast lookups, as a lookup by a vertex/edge ID is the fastest operation in Neptune. If you don't supply a value for the vertex/edge IDs, then Neptune creates an ID using a UUID.
After that you'll want to consider using a conditional write pattern. In Gremlin, you can follow the pattern documented in Practical Gremlin [1].
So the pattern, for your use case, would follow something like:
g.V().hasLabel('customer').has('customer_id',<id>).
fold().coalesce(
unfold(),
addV('customer').property('customer_id',<id>)
).aggregate('c').
V().hasLabel('postcode').has('postcode',<postcode>).
fold().coalesce(
unfold(),
addV('postcode').property('postcode',<postcode>)
).
addE('hasPostCode').from(select('c').unfold())
Note: The
aggregate()
step is used above because we're wanting to label something in our query but then we need to cross a collapsing barrier step (fold()
) later on in the query. If we were to useas()
, the label will not persist beyond the collapsing barrier step.
If using deterministic IDs, this could be simplified. Say we use an ID nomenclature of "customer-id" for customer vertices and "postcode-code" for postcode vertices:
g.V(<customer_id>).
fold().coalesce(
unfold(),
addV('customer').property(id,<customer_id>)
).
V(<postcode_id>).
fold().coalesce(
unfold(),
addV('postcode').property(id,<postcode_id>)
).
addE('hasPostCode').from(V(<customer_id>)
英文:
First, if you want to ensure uniqueness, each vertex and edge in a graph in Neptune must have a unique ID. So it is good practice to leverage that concept to the fullest. Deterministic IDs are also great for fast lookups, as a lookup by a vertex/edge ID is the fastest operation in Neptune. If you don't supply a value for the vertex/edge IDs, then Neptune creates an ID using a UUID.
After that you'll want to consider using a conditional write pattern. In Gremlin, you can follow the pattern documented in Practical Gremlin [1].
So the pattern, for your use case, would follow something like:
g.V().hasLabel('customer').has('customer_id',<id>).
fold().coalesce(
unfold(),
addV('customer').property('customer_id',<id>)
).aggregate('c').
V().hasLabel('postcode').has('postcode',<postcode>).
fold().coalesce(
unfold(),
addV('postcode').property('postcode',<postcode>)
).
addE('hasPostCode').from(select('c').unfold())
>Note: The aggregate()
step is used above because we're wanting to label something in our query but then we need to cross a collapsing barrier step (fold()
) later on in the query. If we were to use as()
, the label will not persist beyond the collapsing barrier step.
If using deterministic IDs, this could be simplified. Say we use an ID nomenclature of "customer-id" for customer vertices and "postcode-code" for postcode vertices:
g.V(<customer_id>).
fold().coalesce(
unfold(),
addV('customer').property(id,<customer_id>)
).
V(<postcode_id>).
fold().coalesce(
unfold(),
addV('postcode').property(id,<postcode_id>)
).
addE('hasPostCode').from(V(<customer_id>)
[1] https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#upsert
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论