英文:
Gremlin, linking an edge to a vertex via property
问题
在图数据库中,我有如下的图形:
v1: Protein{prefName: 'QP1'} 
  -- r1: part_of{evidence: 'ns:testdb'} 
  --> v2: Protcmplx{prefName: 'P12 Complex'}
ev: EvidenceType{ iri = "ns:testdb", label = "Test Database" }
我想编写一个Gremlin查询来获取part_of关系的实例,并返回v1和v2的prefName,以及evidence的label。到目前为止,我尝试了这个查询:
g.V().hasLabel( containing('Protein') ).as('p')
  .outE().hasLabel( 'is_part_of' ).as('pr')
  .inV().hasLabel( containing('Protcmplx') ).as('cpx')
.V().hasLabel( containing('EvidenceType') ).as('ev')
  .has( 'iri', eq( select('pr').by('evidence') ) )
.select( 'p', 'cpx', 'ev', 'pr' )
  .by('prefName')
  .by('prefName')
  .by('label')
  .by('evidence')
.limit(100)
但是对于几千个节点和边来说,它需要很长时间,最终没有返回任何内容。我确信值是存在的,我认为问题出在has( 'iri', ... )上,但我无法弄清楚如何将边属性与另一个顶点的属性匹配。
这个图是这样建模的,因为LPG模型不允许超边(连接超过2个顶点)。
英文:
In a graph database, I have graphs like:
v1: Protein{prefName: 'QP1'} 
  -- r1: part_of{evidence: 'ns:testdb'} 
  --> v2: Protcmplx{prefName: 'P12 Complex'}
ev: EvidenceType{ iri = "ns:testdb", label = "Test Database" }
I'd like to write a Gremlin query to fetch instances of the part_of relationship and return v1 and v2's prefName, along with the evidence's label. So far I've tried this:
g.V().hasLabel( containing('Protein') ).as('p')
  .outE().hasLabel( 'is_part_of' ).as('pr')
  .inV().hasLabel( containing('Protcmplx') ).as('cpx')
.V().hasLabel( containing('EvidenceType') ).as('ev')
  .has( 'iri', eq( select('pr').by('evidence') ) )
.select( 'p', 'cpx', 'ev', 'pr' )
  .by('prefName')
  .by('prefName')
  .by('label')
  .by('evidence')
.limit(100)
But it takes a lot of time for a few thousand nodes+edeges, and eventually, it doesn't return anything. I'm sure the values are there and I think the problem is with has( 'iri', ... ), but I can't figure out how to match an edge property with another property in an unconnected vertex.
The graph is modelled this way, cause the LPG model doesn't allow for hyper-edges (linking >2 vertices).
答案1
得分: 1
问题出在ArcadeDB查询优化器和contains运算符上。如果删除contains并只使用标签名称,它将使用索引并应该在<10ms内返回:
evLabels = [:]
g.V().hasLabel('Concept:Protcmplx:Resource').as('cpx')
// 试图提前设置限制
.inE().hasLabel('is_part_of').limit(100).as('pr')
.outV().hasLabel('Concept:Protein:Resource').as('p')
.select('p', 'cpx', 'pr')
.by('prefName')
.by('prefName')
.by(map{
pr = it.get()
evIri = pr.values('evidence').next();
lbl = evLabels[evIri];
if (lbl != null) return lbl;
lbl = g.V().hasLabel('EvidenceType:Resource')
.has('iri', evIri)
.values('label').next();
evLabels[evIri] = lbl == null ? "" : lbl;
return lbl;
})
英文:
The issue is with ArcadeDB query optimizer and the contains operator. If remove contains and just use the label names, it would use the index and should return in <10ms:
evLabels = [:]
g.V().hasLabel ( 'Concept:Protcmplx:Resource' ).as ( 'cpx' )
// Trying to put the limit early-on
.inE().hasLabel ( 'is_part_of' ).limit ( 100 ).as ( 'pr' )
.outV ().hasLabel ( 'Concept:Protein:Resource' ).as ( 'p' )
.select ( 'p', 'cpx', 'pr' )
.by ( 'prefName' )
.by ( 'prefName' )
.by ( map{
pr = it.get()
evIri = pr.values ( 'evidence' ).next ();
lbl = evLabels [ evIri ];
if ( lbl != null ) return lbl;
lbl = g.V().hasLabel ( 'EvidenceType:Resource' )
.has ( 'iri', evIri )
.values ( 'label' ).next ();
evLabels [ evIri ] = lbl == null ? "" : lbl;
return lbl;
})
答案2
得分: 0
我已找到一种使用where()和by()的方法,但速度相对较慢(从几千个节点+边中获取100个元组需要11秒):
g.V().hasLabel ( containing ( 'Protcmplx' ) ).as ( 'cpx' )
  .inE().hasLabel ( 'is_part_of' ).limit ( 10 ).as ( 'pr' )
  .outV ().hasLabel ( containing ( 'Protein' ) ).as ( 'p' )  
  .V().hasLabel ( containing ( 'EvidenceType' ) ).as ( 'ev' )
    .where ( 'ev', eq ( 'pr' ) ).by ( 'iri' ).by ( 'evidence' ) 
.select ( 'p', 'cpx', 'ev' )
.by ( 'prefName' )
.by ( 'prefName' )
.by ( 'label' )
任何关于优化的帮助都将受到欢迎!
编辑:根据评论的建议(谢谢!),我稍微重写了解决方案(速度仍然很慢),并在最后使用了.profile(),得到了以下结果:
遍历指标
步骤                                                               计数    遍历器          时间(毫秒)    % 持续时间
=============================================================================================================
GraphStep(vertex,[])                                              123591      123591         507.179     9.09
HasStep([~label.containing(Protcmplx)])@[cpx]                         10          10          34.313     0.61
VertexStep(IN,[is_part_of],edge)@[pr]                                 13          13           5.089     0.09
RangeGlobalStep(0,10)                                                 10          10           0.094     0.00
EdgeVertexStep(OUT)                                                   10          10           1.618     0.03
HasStep([~label.containing(Protein)])@                             10          10           0.065     0.00
GraphStep(vertex,[])                                             1738360     1738360        4574.578    81.99
HasStep([~label.containing(EvidenceType)])@[ev]                      510         510         447.546     8.02
WherePredicateStep(ev,eq(pr),[value(iri), value...                    10          10           6.747     0.12
NoOpBarrierStep(2500)                                                 10          10           1.444     0.03
SelectStep(last,
,[value(prefName), ...                    10          10           0.154     0.00
NoOpBarrierStep(2500)                                                 10           8           0.785     0.01
>总计                     -           -        5579.617        -
所以,问题似乎是第二个V()在前一个遍历(在where上)的筛选器应用之前捕获了所有顶点。但是,我找不到避免这种情况的方法。Gremlin是否支持子查询?
编辑/2:受评论中建议的启发(谢谢!),我尝试了以下方法:
evLabels = [:]
g.V().hasLabel ( containing ( 'Protcmplx' ) ).as ( 'cpx' )
  // 尝试尽早限制数量
  .inE().hasLabel ( 'is_part_of' ).limit ( 100 ).as ( 'pr' )
  .outV ().hasLabel ( containing ( 'Protein' ) ).as ( 'p' )
.select ( 'p', 'cpx', 'pr' )
  .by ( 'prefName' )
  .by ( 'prefName' )
  .by { it.get().values('evidence').next() }.map { evIri ->
    lbl = evLabels[evIri]
    if (lbl != null) return lbl
    lbl = g.V().hasLabel(containing('EvidenceType'))
             .has('iri', evIri)
             .values('label').next()
    evLabels[evIri] = lbl == null ? "" : lbl
    return lbl
  }
这种方法通过将子查询结果累积到映射中来避免了完全笛卡尔积连接。这比原始查询要快得多(100个边约小于1秒),但不太容易阅读,我相信有更好的写法。
英文:
I've found a way using where() and by(), but it is quite slow (11secs to get 100 tuples from a few thousands nodes+edges):
g.V().hasLabel ( containing ( 'Protcmplx' ) ).as ( 'cpx' )
  .inE().hasLabel ( 'is_part_of' ).limit ( 10 ).as ( 'pr' )
  .outV ().hasLabel ( containing ( 'Protein' ) ).as ( 'p' )  
.V().hasLabel ( containing ( 'EvidenceType' ) ).as ( 'ev' )
    .where ( 'ev', eq ( 'pr' ) ).by ( 'iri' ).by ( 'evidence' ) 
.select ( 'p', 'cpx', 'ev' )
.by ( 'prefName' )
.by ( 'prefName' )
.by ( 'label' )
Any help with optimisation would be welcome!
EDIT: following a suggestion from the comments (thanks!), I've rewritten the solution a bit (it's still slow) and used .profile() at the end, obtaining this:
Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
GraphStep(vertex,[])                                              123591      123591         507.179     9.09
HasStep([~label.containing(Protcmplx)])@[cpx]                         10          10          34.313     0.61
VertexStep(IN,[is_part_of],edge)@[pr]                                 13          13           5.089     0.09
RangeGlobalStep(0,10)                                                 10          10           0.094     0.00
EdgeVertexStep(OUT)                                                   10          10           1.618     0.03
HasStep([~label.containing(Protein)])@                             10          10           0.065     0.00
GraphStep(vertex,[])                                             1738360     1738360        4574.578    81.99
HasStep([~label.containing(EvidenceType)])@[ev]                      510         510         447.546     8.02
WherePredicateStep(ev,eq(pr),[value(iri), value...                    10          10           6.747     0.12
NoOpBarrierStep(2500)                                                 10          10           1.444     0.03
SelectStep(last,
,[value(prefName), ...                    10          10           0.154     0.00
NoOpBarrierStep(2500)                                                 10           8           0.785     0.01
>TOTAL                     -           -        5579.617        -
So, the problem seems to be that the second V() picks up all the vertexes before the filters from the former traversal (on the where) can be applied. However, I can't find a way to avoid this. Does Gremlin have subqueries?
EDIT/2: inspired by the suggestion in the comments to use two separated queries (thanks!), I've tried this:
evLabels = [:]
g.V().hasLabel ( containing ( 'Protcmplx' ) ).as ( 'cpx' )
  // Trying to put the limit early-on
  .inE().hasLabel ( 'is_part_of' ).limit ( 100 ).as ( 'pr' )
  .outV ().hasLabel ( containing ( 'Protein' ) ).as ( 'p' )
.select ( 'p', 'cpx', 'pr' )
  .by ( 'prefName' )
  .by ( 'prefName' )
  .by ( map{
    pr = it.get()
    evIri = pr.values ( 'evidence' ).next ();
    lbl = evLabels [ evIri ];
    if ( lbl != null ) return lbl;
    lbl = g.V().hasLabel ( containing ( 'EvidenceType' ) )
             .has ( 'iri', evIri )
             .values ( 'label' ).next ();
    evLabels [ evIri ] = lbl == null ? "" : lbl;
    return lbl;
  })
Which avoids a full cartesian product join by accumulating sub-query results into a map. This is much faster than the original query (like <1s for 100 edges), but not very simple to read, I'm sure there is a better way to write the same.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论