英文:
Cassandra: ALLOW FILTERING works twice as fast as index on a mapped column
问题
ALLOW FILTERING 是一种性能较差的查询方式,通常不建议在生产环境中使用。但在你的情况下,移除二级索引并添加 ALLOW FILTERING 后查询速度更快,这可能与你的表结构和数据分布有关。在你的表中,tags 列是一个映射类型的列,而针对映射类型列的索引可能会导致性能问题。
你的索引 idx_tags
针对 tags 列的映射进行了索引,但根据你的查询条件,似乎需要对 tags 列进行过滤。这可能导致了索引的效率下降,因此使用 ALLOW FILTERING 来执行查询可能更快,尽管它不是最佳实践。
总之,这种情况可能取决于你的数据分布和查询模式,但仍然建议在可能的情况下避免使用 ALLOW FILTERING,因为它通常会引起性能问题。你可以考虑优化表结构或查询以改善性能。
英文:
I read how ALLOW FILTERING is terrible and shall never ever be used in production. To my surprise, I discovered that in my case a query runs FASTER when I remove a secondary index and add ALLOW FILTERING. How is that possible?
More details: this query involves a secondary index on a map column.
Even more details
My table looks like this (some column names were changed to protect the IP, but that should not matter):
CREATE TABLE IF NOT EXISTS test.data (
channel text,
subdomain text,
deployment_group text,
data_center text,
day date, // we need this to limit the partition size
message_id timeuuid,
tags map<text,text>,
sender_id text,
sequence_number bigint,
sent_at timestamp,
received_at timestamp,
header blob,
payload blob,
PRIMARY KEY ((channel, subdomain, deployment_group, data_center, day), sent_at, message_id)
)
WITH CLUSTERING ORDER BY (sent_at ASC, message_id ASC);
The index I was using is
CREATE INDEX idx_tags ON test.data (ENTRIES (tags));
The table has about 10M records, the time range query has about 50K rows, and additional condition on the tag column selects about 8K rows.
This query runs for 2.5s:
SELECT message_id, sender_id, sequence_number, sent_at, tags FROM test.data
WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='_'
AND data_center='xyz' and day='2023-07-12'
and sent_at >= '2023-07-12 20:00:00+0000' and sent_at <'2023-07-12 21:00:00+0000'
and tags['symbol']='ABC';
If I drop the index and re-run the query with ALLOW FILTERING, it takes 2.5s:
This query runs for 4.4s:
SELECT message_id, sender_id, sequence_number, sent_at, tags FROM test.data
WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='_'
AND data_center='xyz' and day='2023-07-12'
and sent_at >= '2023-07-12 20:00:00+0000' and sent_at <'2023-07-12 21:00:00+0000'
and tags['symbol']='ABC'
ALLOW FILTERING;
How is it possible? Are indexes on mapped columns THAT inefficient? I am flabbergasted.
答案1
得分: 1
In the WHERE clause, you included all variables of your partition key: (WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='_' AND data_center='xyz' and day='2023-07-12'). This means the query will execute on a single partition. ALLOW FILTERING
isn't harmful as long as the partition isn't huge, considering table size and query inequalities.
The secondary index scans all nodes for the index, seeks partitions to scan first, and then performs the task. This is where you're experiencing delays.
英文:
Looking at your query I see that in the WHERE clause you put every single variables of your partition key. (WHERE channel='MyChannel' AND subdomain='test' AND deployment_group=''
AND data_center='xyz' and day='2023-07-12')_
As a consequence, the query will execute on a single partition. Under those circumstances the ALLOW FILTERING
is not actually an harmful thing as long as the partition is not huge. (not the size of the table, nor the inequalities positionned after in the query.
The secondary index will look for the index on every node, it tries to find the list of partitions to scan first and then do then job, this is where you are loosing time.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论