Cassandra:ALLOW FILTERING在映射列上的速度是索引的两倍。

huangapple go评论62阅读模式
英文:

Cassandra: ALLOW FILTERING works twice as fast as index on a mapped column

问题

ALLOW FILTERING 是一种性能较差的查询方式,通常不建议在生产环境中使用。但在你的情况下,移除二级索引并添加 ALLOW FILTERING 后查询速度更快,这可能与你的表结构和数据分布有关。在你的表中,tags 列是一个映射类型的列,而针对映射类型列的索引可能会导致性能问题。

你的索引 idx_tags 针对 tags 列的映射进行了索引,但根据你的查询条件,似乎需要对 tags 列进行过滤。这可能导致了索引的效率下降,因此使用 ALLOW FILTERING 来执行查询可能更快,尽管它不是最佳实践。

总之,这种情况可能取决于你的数据分布和查询模式,但仍然建议在可能的情况下避免使用 ALLOW FILTERING,因为它通常会引起性能问题。你可以考虑优化表结构或查询以改善性能。

英文:

I read how ALLOW FILTERING is terrible and shall never ever be used in production. To my surprise, I discovered that in my case a query runs FASTER when I remove a secondary index and add ALLOW FILTERING. How is that possible?

More details: this query involves a secondary index on a map column.

Even more details

My table looks like this (some column names were changed to protect the IP, but that should not matter):

CREATE TABLE IF NOT EXISTS test.data (
  channel text,
  subdomain text,
  deployment_group text,
  data_center text,
  day date, // we need this to limit the partition size
  message_id timeuuid,
  tags map<text,text>,
  sender_id text,
  sequence_number bigint,
  sent_at timestamp,
  received_at timestamp,
  header blob,
  payload blob,
  PRIMARY KEY ((channel, subdomain, deployment_group, data_center, day), sent_at, message_id)
)
WITH CLUSTERING ORDER BY (sent_at ASC, message_id ASC);

The index I was using is
CREATE INDEX idx_tags ON test.data (ENTRIES (tags));

The table has about 10M records, the time range query has about 50K rows, and additional condition on the tag column selects about 8K rows.

This query runs for 2.5s:

SELECT message_id, sender_id, sequence_number, sent_at, tags FROM test.data 
WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='_' 
AND data_center='xyz' and day='2023-07-12' 
and sent_at >= '2023-07-12 20:00:00+0000' and sent_at <'2023-07-12 21:00:00+0000' 
and tags['symbol']='ABC';

If I drop the index and re-run the query with ALLOW FILTERING, it takes 2.5s:

This query runs for 4.4s:

SELECT message_id, sender_id, sequence_number, sent_at, tags FROM test.data 
WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='_' 
AND data_center='xyz' and day='2023-07-12' 
and sent_at >= '2023-07-12 20:00:00+0000' and sent_at <'2023-07-12 21:00:00+0000' 
and tags['symbol']='ABC' 
ALLOW FILTERING;

How is it possible? Are indexes on mapped columns THAT inefficient? I am flabbergasted.

答案1

得分: 1

In the WHERE clause, you included all variables of your partition key: (WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='_' AND data_center='xyz' and day='2023-07-12'). This means the query will execute on a single partition. ALLOW FILTERING isn't harmful as long as the partition isn't huge, considering table size and query inequalities.

The secondary index scans all nodes for the index, seeks partitions to scan first, and then performs the task. This is where you're experiencing delays.

英文:

Looking at your query I see that in the WHERE clause you put every single variables of your partition key. (WHERE channel='MyChannel' AND subdomain='test' AND deployment_group=''
AND data_center='xyz' and day='2023-07-12')_

As a consequence, the query will execute on a single partition. Under those circumstances the ALLOW FILTERING is not actually an harmful thing as long as the partition is not huge. (not the size of the table, nor the inequalities positionned after in the query.

The secondary index will look for the index on every node, it tries to find the list of partitions to scan first and then do then job, this is where you are loosing time.

huangapple
  • 本文由 发表于 2023年7月13日 12:33:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76675960.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定