Amazon Neptune 中顶点的重复

huangapple go评论61阅读模式
英文:

Duplication of vertices in Amazon Neptune

问题

我想在Amazon Neptune中使用Gremlin创建一些逻辑,实现以下操作:

1. 加载包含customer_idpostcode列的数据行。

2. 检查该行的postcode值是否已存在于数据库中:

A. 如果已存在,则为该行的customer_id值创建一个新顶点,然后创建一个新边,将连接刚创建的customer_id顶点已存在的postcode顶点。

B. 否则,如果不存在,则为该行的customer_id值创建一个新顶点,为该行的postcode值创建一个新顶点,然后创建一个新边,将连接刚创建的customer_id顶点刚创建的postcode顶点。

  • 这样做的目的是避免创建重复的顶点。
  • 如果您认为我的逻辑存在问题,我愿意尝试不同的方法。
  • 我尝试过一些方法,但未能找到一段代码来执行上述所有操作。
  • 我正在使用Gremlin。
英文:

I want to create some logic that does the following in Amazon Neptune using Gremlin:

1. Load a row of data that contains customer_id and postcode columns

2. Check if the postcode value from that row already exists in the database:

A. If it does, then create a new vertex for the row's customer_id value and then create a new edge that makes a connection from the customer_id vertex that has just been created to the pre-existing postcode vertex

B. Else, if it does not, then create a new vertex for the row's customer_id value, create a new vertex for the row's postcode value and then create a new edge that makes a connection from the customer_id vertex that has just been created to the postcode vertex that has just been created

  • The purpose of this is to avoid creating duplicate vertices.
  • I am open to different approaches if you can see flaws in my logic.
  • I have tried a few methods but I've been unable to get a single piece of logic to perform all of the above.
  • I am using Gremlin.

答案1

得分: 1

First, if you want to ensure uniqueness, each vertex and edge in a graph in Neptune must have a unique ID. So it is good practice to leverage that concept to the fullest. Deterministic IDs are also great for fast lookups, as a lookup by a vertex/edge ID is the fastest operation in Neptune. If you don't supply a value for the vertex/edge IDs, then Neptune creates an ID using a UUID.

After that you'll want to consider using a conditional write pattern. In Gremlin, you can follow the pattern documented in Practical Gremlin [1].

So the pattern, for your use case, would follow something like:

g.V().hasLabel('customer').has('customer_id',<id>).
    fold().coalesce(
        unfold(),
        addV('customer').property('customer_id',<id>)
    ).aggregate('c').
    V().hasLabel('postcode').has('postcode',<postcode>).
        fold().coalesce(
            unfold(),
            addV('postcode').property('postcode',<postcode>)
        ).
    addE('hasPostCode').from(select('c').unfold())

Note: The aggregate() step is used above because we're wanting to label something in our query but then we need to cross a collapsing barrier step (fold()) later on in the query. If we were to use as(), the label will not persist beyond the collapsing barrier step.

If using deterministic IDs, this could be simplified. Say we use an ID nomenclature of "customer-id" for customer vertices and "postcode-code" for postcode vertices:

g.V(<customer_id>).
    fold().coalesce(
        unfold(),
        addV('customer').property(id,<customer_id>)
    ).
    V(<postcode_id>).
        fold().coalesce(
            unfold(),
            addV('postcode').property(id,<postcode_id>)
        ).
    addE('hasPostCode').from(V(<customer_id>)
英文:

First, if you want to ensure uniqueness, each vertex and edge in a graph in Neptune must have a unique ID. So it is good practice to leverage that concept to the fullest. Deterministic IDs are also great for fast lookups, as a lookup by a vertex/edge ID is the fastest operation in Neptune. If you don't supply a value for the vertex/edge IDs, then Neptune creates an ID using a UUID.

After that you'll want to consider using a conditional write pattern. In Gremlin, you can follow the pattern documented in Practical Gremlin [1].

So the pattern, for your use case, would follow something like:

g.V().hasLabel(&#39;customer&#39;).has(&#39;customer_id&#39;,&lt;id&gt;).
    fold().coalesce(
        unfold(),
        addV(&#39;customer&#39;).property(&#39;customer_id&#39;,&lt;id&gt;)
    ).aggregate(&#39;c&#39;).
    V().hasLabel(&#39;postcode&#39;).has(&#39;postcode&#39;,&lt;postcode&gt;).
        fold().coalesce(
            unfold(),
            addV(&#39;postcode&#39;).property(&#39;postcode&#39;,&lt;postcode&gt;)
        ).
    addE(&#39;hasPostCode&#39;).from(select(&#39;c&#39;).unfold())

>Note: The aggregate() step is used above because we're wanting to label something in our query but then we need to cross a collapsing barrier step (fold()) later on in the query. If we were to use as(), the label will not persist beyond the collapsing barrier step.

If using deterministic IDs, this could be simplified. Say we use an ID nomenclature of "customer-id" for customer vertices and "postcode-code" for postcode vertices:

g.V(&lt;customer_id&gt;).
    fold().coalesce(
        unfold(),
        addV(&#39;customer&#39;).property(id,&lt;customer_id&gt;)
    ).
    V(&lt;postcode_id&gt;).
        fold().coalesce(
            unfold(),
            addV(&#39;postcode&#39;).property(id,&lt;postcode_id&gt;)
        ).
    addE(&#39;hasPostCode&#39;).from(V(&lt;customer_id&gt;)

[1] https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#upsert

huangapple
  • 本文由 发表于 2023年3月7日 20:18:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75661886.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定