我的Neo4J文本索引为什么对查询性能没有影响?

huangapple go评论58阅读模式
英文:

Why does my Neo4J Text Index have no Impact on query performance?

问题

我已经创建了一个简单的neo4j图,并正在测试对属性进行索引的影响,但我不明白为什么索引,似乎被使用和运行正常,却没有提高只读查询性能。

图和索引创建详细信息:

有两种节点标签(src_node和tag),每个节点都有一个单一的文本属性,分别是src_prop和tag_str。

我在“tag”节点的tag_str属性上创建了一个简单的索引。

"CREATE TEXT INDEX `tag_idx` FOR (n:`tag`) ON (n.`tag_str`)

我创建了1000个tag节点,每个节点都有一个随机生成的8个字符长的tag_str值。

例如CREATE (:tag {tag_str:"A1B2C3D4"})

我创建了100000个源节点,每个节点都有一个长度为32的随机字符串str_prop,每个节点都与一个tag关联,边缘标记为“contains”。

MATCH (tgt:tag WHERE tgt.tag_str= "A1B2C3D4") CREATE (:src_node{src_prop:<32_char_string>})-[:contains]->(tgt)

这一切都正常工作,我可以确认链接已经创建。

查找查询

我运行一个简单的查询,以查找与特定标签关联的节点,作为示例

MATCH (tgt:tag WHERE tgt.tag_str = "asEQqNLF") - [] - (retnode) return retnode

我可以查看索引统计信息,并验证在运行此查询时是否正在读取它。

当我比较索引和非索引运行时,发现索引的性能对于查找查询(而不是src_node创建查询)要么相同,要么稍微差一点。我不明白为什么,所以希望有人能提供一些见解。

例如:我是否创建了不正确类型的索引?我是否在查询的其他部分,如边缘遍历中遇到了瓶颈?是否有一些我错过的配置?

澄清更新

定时测试运行超过100,000次,包括每个事务的时间以及具有和没有索引的端到端时间。我还扩展了图的大小,并看到类似的结果,但将继续根据下面的反馈进行调查。

已修复此处帖子中存在但*实际测试中不存在的小错字。

英文:

I have created a simple neo4j graph, and am testing the impact of indexing along a property, and am struggling to understand why the index, which appears to be used and functioning, is not improving read-only query performance.

Graph And Index Creation Details:

There are 2 types of node labels (src_node, and tag) and each node has a single text property, respectively src_prop and tag_str.

I create a simple index over the tag_str property on the "tag" nodes.

&quot;CREATE TEXT INDEX `tag_idx` FOR (n:`tag`) ON (n.`tag_str`)

I create 1000 tag nodes, each with a randomly generated tag_str value that is 8 characters long.

e.g. CREATE (:tag {tag_str:&quot;A1B2C3D4&quot;})

I create 100000 source nodes, each with a str_prop that is a random string of length 32,
each of these are linked to a single tag, with an edge labeled "contains"

MATCH (tgt:tag WHERE tgt.tag_str= &quot;A1B2C3D4&quot;) CREATE (:src_node{src_prop:&lt;32_char_string&gt;})-[:contains]-&gt;(tgt)

This all works fine, I can confirm that the links are created.

Find Query

I run a simple query to find nodes linked to a particular tag, as an example

MATCH (tgt:tag WHERE tgt.tag_str = &quot;asEQqNLF&quot;) - [] - (retnode) return retnode

I can look at the index stats, and verify that it is being read when this query is run.

When I compare the indexed vs non-indexed runs, the indexed performance for the find queries (not the src_node creation queries) is either identical or slightly worse. I am struggling to understand why, so would appreciate any insights folks might have.

e.g. am I creating an incorrect type of index? Am I bottlenecked in another part of the query such as edge traversal? Is there some configuration I have missed?

CLARIFICATION UPDATE

The timing test is run upwards of 100k times, includes per-transaction time as well as end-to-end timings both with and without the index. I have scaled the size of the graph as well and see similar results, but will continue investigation based on feedback below.

Fixed a minor typo that was existent in post here but not in the actual tests I am referencing.

答案1

得分: 1

  1. 1000个索引节点可能对性能产生不明显的影响,特别是如果所有节点都被缓存在内存中(因为扫描如此小数量的没有索引的节点本身会非常快)。

  2. 通过仅执行每个操作一次来比较所需的时间,无法得出任何有效的结论。您需要多次执行这些操作。

  3. 除了使用索引(或不使用索引)之外,您的测试还要遍历关系并返回另一端的节点(节点)。执行此额外的工作所需的时间会掩盖(甚至可能淹没)获取起始节点所需的时间。

通过多次执行您希望比较的操作来获得更准确的比较。例如(其中$tag_str_list是大约500个有效的tag_str值的参数):

MATCH (t:tag)
WHERE t.tag_str IN $tag_str_list
RETURN COUNT(*) AS cnt
英文:

Some thoughts:

  1. 1000 indexed nodes may be too small a number for a significant difference in performance, especially if all the nodes are cached in memory (since scanning such a small number of nodes without an index would be very fast anyway).

  2. You cannot arrive at any valid conclusions by comparing the time it takes to execute each operation just once. You need to execute the operations a large number of times.

  3. In addition to using the index (or not), your test is also traversing relationship(s) and returning the node(s) at the other end. The amount of time to do that additional work obscures (and may swamp) the time it takes to get the starting node.

A more accurate comparison would be achieved by performing just the operation(s) you wish to compare many times. For example (where $tag_str_list is a parameter of maybe 500 valid tag_str values):

MATCH (t:tag)
WHERE t.tag_str IN $tag_str_list
RETURN COUNT(*) AS cnt

huangapple
  • 本文由 发表于 2023年6月29日 06:47:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76577135.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定