CTS Range Query vs SPARQL Query Performance

huangapple go评论54阅读模式
英文:

CTS Range Query vs SPARQL Query Performance

问题

我可以看到“CTS范围查询 vs SPARQL查询”所花时间的差异。

CTS范围查询 - 获取结果花费了0.8毫秒,必须创建字段索引以使字段查询正常工作。

cts:field-values("productid", (), (), cts:and-query(
              (
                cts:field-value-query("countryCode", "us", ("unstemmed","case-insensitive",
                 "whitespace-insensitive","punctuation-insensitive",
                 "diacritic-insensitive")),
                cts:field-value-query("status", "published",
                 ("unstemmed","case-insensitive","whitespace-insensitive",
                 "punctuation-insensitive","diacritic-insensitive"))
              )
          ))

SPARQL查询 - 获取结果花费了18毫秒,必须创建TDE以使SPARQL查询正常工作。

## 查询
SELECT ?productid
FROM <product>
WHERE {
  ?productid <status> <Published>;
             <countryCode> <US>.
}

产品的TDE-

<?xml version="1.0" encoding="UTF-8"?>
<template xmlns="http://marklogic.com/xdmp/tde">
	<context>product</context>
	<enabled>true</enabled>
	<collections>
		<collection>product</collection>
	</collections>
	<triples>
		<triple>
			<subject>
				<val>sem:iri(productid)</val>
				<invalid-values>ignore</invalid-values>
			</subject>
			<predicate>
				<val>sem:iri(xs:string("languageCode"))</val>
				<invalid-values>ignore</invalid-values>
			</predicate>
			<object>
				<val>sem:iri(languageCode)</val>
				<invalid-values>ignore</invalid-values>
			</object>
		</triple>
        ...
        (此处省略了其他三元组)
	</triples>
</template>

请帮助我理解为什么这两种类型的查询之间存在速度/性能差异?
任何帮助都将不胜感激。

英文:

I can see the difference in time taken by CTS Range vs SPARQL Query.

CTS Range Query - took 0.8ms to get the result, required field indexes are created to make filed query work.

cts:field-values("productid", (), (), cts:and-query(
              (
                cts:field-value-query("countryCode", "us", ("unstemmed","case-insensitive", "whitespace-insensitive", "punctuation-insensitive", "diacritic-insensitive")),
                cts:field-value-query("status", "published", ("unstemmed","case-insensitive", "whitespace-insensitive", "punctuation-insensitive", "diacritic-insensitive"))
              )
          ))

SPARQL Query - took 18ms to get the result, TDE is created to make SPARQL query work.

## query
SELECT ?productid
FROM <product>
WHERE {
  ?productid <status> <Published>;
             <countryCode> <US>.
}

TDE for product-

<?xml version="1.0" encoding="UTF-8"?>
<template xmlns="http://marklogic.com/xdmp/tde">
	<context>product</context>
	<enabled>true</enabled>
	<collections>
		<collection>product</collection>
	</collections>
	<triples>
		<triple>
			<subject>
				<val>sem:iri(productid)</val>
				<invalid-values>ignore</invalid-values>
			</subject>
			<predicate>
				<val>sem:iri(xs:string("languageCode"))</val>
				<invalid-values>ignore</invalid-values>
			</predicate>
			<object>
				<val>sem:iri(languageCode)</val>
				<invalid-values>ignore</invalid-values>
			</object>
		</triple>
		<triple>
			<subject>
				<val>sem:iri(productid)</val>
				<invalid-values>ignore</invalid-values>
			</subject>
			<predicate>
				<val>sem:iri(xs:string("countryCode"))</val>
				<invalid-values>ignore</invalid-values>
			</predicate>
			<object>
				<val>sem:iri(fn:normalize-space(xs:string(countryCode)))</val>
				<invalid-values>ignore</invalid-values>
			</object>
		</triple>
		<triple>
			<subject>
				<val>sem:iri(productid)</val>
				<invalid-values>ignore</invalid-values>
			</subject>
			<predicate>
				<val>sem:iri(xs:string("status"))</val>
				<invalid-values>ignore</invalid-values>
			</predicate>
			<object>
				<val>sem:iri(fn:normalize-space(xs:string(status)))</val>
				<invalid-values>ignore</invalid-values>
			</object>
		</triple>
		<triple>
			<subject>
				<val>sem:iri(productid)</val>
				<invalid-values>ignore</invalid-values>
			</subject>
			<predicate>
				<val>sem:iri(xs:string("created"))</val>
				<invalid-values>ignore</invalid-values>
			</predicate>
			<object>
				<val>sem:iri(audit/created)</val>
				<invalid-values>ignore</invalid-values>
			</object>
		</triple>
		<triple>
			<subject>
				<val>sem:iri(productid)</val>
				<invalid-values>ignore</invalid-values>
			</subject>
			<predicate>
				<val>sem:iri(xs:string("createdBy"))</val>
				<invalid-values>ignore</invalid-values>
			</predicate>
			<object>
				<val>sem:iri(audit/createdBy)</val>
				<invalid-values>ignore</invalid-values>
			</object>
		</triple>
		<triple>
			<subject>
				<val>sem:iri(productid)</val>
				<invalid-values>ignore</invalid-values>
			</subject>
			<predicate>
				<val>sem:iri(xs:string("updated"))</val>
				<invalid-values>ignore</invalid-values>
			</predicate>
			<object>
				<val>sem:iri(audit/updated)</val>
				<invalid-values>ignore</invalid-values>
			</object>
		</triple>
		<triple>
			<subject>
				<val>sem:iri(productid)</val>
				<invalid-values>ignore</invalid-values>
			</subject>
			<predicate>
				<val>sem:iri(xs:string("updatedBy"))</val>
				<invalid-values>ignore</invalid-values>
			</predicate>
			<object>
				<val>sem:iri(audit/updatedBy)</val>
				<invalid-values>ignore</invalid-values>
			</object>
		</triple>
	</triples>
</template>

Please help me to undestand, why there is speed/perofmance difference between these two types of queries ?

Any help is appreciated.

答案1

得分: 3

以下是翻译好的部分:

基于范围索引的查询:
在这个示例中,您正在使用预定义的范围索引。这些索引是内存映射的。每个值还包括指向该值所对应文档片段的指针(文档片段ID是基于整数的词典)。此第一个查询通过两个范围查询限定了片段的范围,然后从范围索引中返回值(范围索引已经是一个唯一的词典)。

在这种情况下,可以将其视为(CountryCode=US ∩ Status=Published)片段ID的内存内交集。然后交集这些ID到productId的内存索引中。

一切都在内存中,不需要去重。以固定、预配置的索引和专用内存为代价。

SPARQL查询:
在这种情况下,您现在正在遍历数据图。查询解析完全不同,可能会根据数据和缓存机制发生去重,内存需求也不同。

范围索引没有移动部件。然而,SPARQL查询有更多可以进行调整的项目。

各种设置在这里解释:https://docs.marklogic.com/guide/semantics/indexes

另外,如果您在查询控制台的SPARQL选项卡中进行测试,则您依赖于与选项相关的选择。优化器和其他选项可以在这里查看:https://docs.marklogic.com/sem:sparql

英文:

There are many factors related to this. Including infrastructure and tuning of various indexes and caches. I will not attempt to qualify the difference in speed directly, but instead help You understand the major differences in the two approaches You show.

Under the hood, the two approaches are different implementations.

Range Index based query:
In that example, you are using pre-defined range indexes. These are memory mapped. Each value also includes a pointer to the the document fragments for which the value (and the fragment ID is an integer-based lexicon). This first query limits the fragments in scope via your two range queries and then returns the values from the range index(already a unique lexicon as well).

In this case, One can think of it as an in-memory intersection of the fragment IDs of (CountryCode=US ∩ Status=Published). Then an intersection of those ids to the in-memory index of productId

All in memory, no deduplication needed. At a cost of rigid, pre-configured indexes and dedicated memory.

SPARQL Query:
In this case, you are now traversing a graph of data. The query resolution is completely different, there may be deduplication happening depending on your data and the caching mechanism and memory needs are different.

Range Indexes have no moving parts. However, SPARQL queries have more items that can be tuned.

Various settings are explained here: https://docs.marklogic.com/guide/semantics/indexes

Also, if you are testing this in the SPARQL tab in Query console, then you are relying on choices being made for you related to options. Optimizer and other options could be looked at here: https://docs.marklogic.com/sem:sparql

huangapple
  • 本文由 发表于 2023年7月10日 18:08:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76652712.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定