英文:
Duplication of vertices in Amazon Neptune
问题
我想在Amazon Neptune中使用Gremlin创建一些逻辑,实现以下操作:
1. 加载包含customer_id和postcode列的数据行。
2. 检查该行的postcode值是否已存在于数据库中:
A. 如果已存在,则为该行的customer_id值创建一个新顶点,然后创建一个新边,将连接从刚创建的customer_id顶点到已存在的postcode顶点。
B. 否则,如果不存在,则为该行的customer_id值创建一个新顶点,为该行的postcode值创建一个新顶点,然后创建一个新边,将连接从刚创建的customer_id顶点到刚创建的postcode顶点。
- 这样做的目的是避免创建重复的顶点。
 - 如果您认为我的逻辑存在问题,我愿意尝试不同的方法。
 - 我尝试过一些方法,但未能找到一段代码来执行上述所有操作。
 - 我正在使用Gremlin。
 
英文:
I want to create some logic that does the following in Amazon Neptune using Gremlin:
1. Load a row of data that contains customer_id and postcode columns
2. Check if the postcode value from that row already exists in the database:
A. If it does, then create a new vertex for the row's customer_id value and then create a new edge that makes a connection from the customer_id vertex that has just been created to the pre-existing postcode vertex
B. Else, if it does not, then create a new vertex for the row's customer_id value, create a new vertex for the row's postcode value and then create a new edge that makes a connection from the customer_id vertex that has just been created to the postcode vertex that has just been created
- The purpose of this is to avoid creating duplicate vertices.
 - I am open to different approaches if you can see flaws in my logic.
 - I have tried a few methods but I've been unable to get a single piece of logic to perform all of the above.
 - I am using Gremlin.
 
答案1
得分: 1
First, if you want to ensure uniqueness, each vertex and edge in a graph in Neptune must have a unique ID. So it is good practice to leverage that concept to the fullest. Deterministic IDs are also great for fast lookups, as a lookup by a vertex/edge ID is the fastest operation in Neptune. If you don't supply a value for the vertex/edge IDs, then Neptune creates an ID using a UUID.
After that you'll want to consider using a conditional write pattern. In Gremlin, you can follow the pattern documented in Practical Gremlin [1].
So the pattern, for your use case, would follow something like:
g.V().hasLabel('customer').has('customer_id',<id>).
    fold().coalesce(
        unfold(),
        addV('customer').property('customer_id',<id>)
    ).aggregate('c').
    V().hasLabel('postcode').has('postcode',<postcode>).
        fold().coalesce(
            unfold(),
            addV('postcode').property('postcode',<postcode>)
        ).
    addE('hasPostCode').from(select('c').unfold())
Note: The
aggregate()step is used above because we're wanting to label something in our query but then we need to cross a collapsing barrier step (fold()) later on in the query. If we were to useas(), the label will not persist beyond the collapsing barrier step.
If using deterministic IDs, this could be simplified. Say we use an ID nomenclature of "customer-id" for customer vertices and "postcode-code" for postcode vertices:
g.V(<customer_id>).
    fold().coalesce(
        unfold(),
        addV('customer').property(id,<customer_id>)
    ).
    V(<postcode_id>).
        fold().coalesce(
            unfold(),
            addV('postcode').property(id,<postcode_id>)
        ).
    addE('hasPostCode').from(V(<customer_id>)
英文:
First, if you want to ensure uniqueness, each vertex and edge in a graph in Neptune must have a unique ID. So it is good practice to leverage that concept to the fullest. Deterministic IDs are also great for fast lookups, as a lookup by a vertex/edge ID is the fastest operation in Neptune. If you don't supply a value for the vertex/edge IDs, then Neptune creates an ID using a UUID.
After that you'll want to consider using a conditional write pattern. In Gremlin, you can follow the pattern documented in Practical Gremlin [1].
So the pattern, for your use case, would follow something like:
g.V().hasLabel('customer').has('customer_id',<id>).
    fold().coalesce(
        unfold(),
        addV('customer').property('customer_id',<id>)
    ).aggregate('c').
    V().hasLabel('postcode').has('postcode',<postcode>).
        fold().coalesce(
            unfold(),
            addV('postcode').property('postcode',<postcode>)
        ).
    addE('hasPostCode').from(select('c').unfold())
>Note: The aggregate() step is used above because we're wanting to label something in our query but then we need to cross a collapsing barrier step (fold()) later on in the query.  If we were to use as(), the label will not persist beyond the collapsing barrier step.
If using deterministic IDs, this could be simplified. Say we use an ID nomenclature of "customer-id" for customer vertices and "postcode-code" for postcode vertices:
g.V(<customer_id>).
    fold().coalesce(
        unfold(),
        addV('customer').property(id,<customer_id>)
    ).
    V(<postcode_id>).
        fold().coalesce(
            unfold(),
            addV('postcode').property(id,<postcode_id>)
        ).
    addE('hasPostCode').from(V(<customer_id>)
[1] https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#upsert
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论