问题

ALLOW FILTERING 是一种性能较差的查询方式，通常不建议在生产环境中使用。但在你的情况下，移除二级索引并添加 ALLOW FILTERING 后查询速度更快，这可能与你的表结构和数据分布有关。在你的表中，tags 列是一个映射类型的列，而针对映射类型列的索引可能会导致性能问题。

你的索引 idx_tags 针对 tags 列的映射进行了索引，但根据你的查询条件，似乎需要对 tags 列进行过滤。这可能导致了索引的效率下降，因此使用 ALLOW FILTERING 来执行查询可能更快，尽管它不是最佳实践。

总之，这种情况可能取决于你的数据分布和查询模式，但仍然建议在可能的情况下避免使用 ALLOW FILTERING，因为它通常会引起性能问题。你可以考虑优化表结构或查询以改善性能。

英文:

I read how ALLOW FILTERING is terrible and shall never ever be used in production. To my surprise, I discovered that in my case a query runs FASTER when I remove a secondary index and add ALLOW FILTERING. How is that possible?

More details: this query involves a secondary index on a map column.

Even more details

My table looks like this (some column names were changed to protect the IP, but that should not matter):

CREATE TABLE IF NOT EXISTS test.data (
  channel text,
  subdomain text,
  deployment_group text,
  data_center text,
  day date, // we need this to limit the partition size
  message_id timeuuid,
  tags map&lt;text,text&gt;,
  sender_id text,
  sequence_number bigint,
  sent_at timestamp,
  received_at timestamp,
  header blob,
  payload blob,
  PRIMARY KEY ((channel, subdomain, deployment_group, data_center, day), sent_at, message_id)
)
WITH CLUSTERING ORDER BY (sent_at ASC, message_id ASC);

The index I was using is
CREATE INDEX idx_tags ON test.data (ENTRIES (tags));

The table has about 10M records, the time range query has about 50K rows, and additional condition on the tag column selects about 8K rows.

This query runs for 2.5s:

SELECT message_id, sender_id, sequence_number, sent_at, tags FROM test.data 
WHERE channel=&#39;MyChannel&#39; AND subdomain=&#39;test&#39; AND deployment_group=&#39;_&#39; 
AND data_center=&#39;xyz&#39; and day=&#39;2023-07-12&#39; 
and sent_at &gt;= &#39;2023-07-12 20:00:00+0000&#39; and sent_at &lt;&#39;2023-07-12 21:00:00+0000&#39; 
and tags[&#39;symbol&#39;]=&#39;ABC&#39;;

If I drop the index and re-run the query with ALLOW FILTERING, it takes 2.5s:

This query runs for 4.4s:

SELECT message_id, sender_id, sequence_number, sent_at, tags FROM test.data 
WHERE channel=&#39;MyChannel&#39; AND subdomain=&#39;test&#39; AND deployment_group=&#39;_&#39; 
AND data_center=&#39;xyz&#39; and day=&#39;2023-07-12&#39; 
and sent_at &gt;= &#39;2023-07-12 20:00:00+0000&#39; and sent_at &lt;&#39;2023-07-12 21:00:00+0000&#39; 
and tags[&#39;symbol&#39;]=&#39;ABC&#39; 
ALLOW FILTERING;

How is it possible? Are indexes on mapped columns THAT inefficient? I am flabbergasted.

答案1

得分: 1

In the WHERE clause, you included all variables of your partition key: (WHERE channel='MyChannel' AND subdomain='test' AND deployment_group='_' AND data_center='xyz' and day='2023-07-12'). This means the query will execute on a single partition. ALLOW FILTERING isn't harmful as long as the partition isn't huge, considering table size and query inequalities.

The secondary index scans all nodes for the index, seeks partitions to scan first, and then performs the task. This is where you're experiencing delays.

英文:

Looking at your query I see that in the WHERE clause you put every single variables of your partition key. (WHERE channel='MyChannel' AND subdomain='test' AND deployment_group=''
AND data_center='xyz' and day='2023-07-12')_

As a consequence, the query will execute on a single partition. Under those circumstances the ALLOW FILTERING is not actually an harmful thing as long as the partition is not huge. (not the size of the table, nor the inequalities positionned after in the query.

The secondary index will look for the index on every node, it tries to find the list of partitions to scan first and then do then job, this is where you are loosing time.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Cassandra：ALLOW FILTERING在映射列上的速度是索引的两倍。

问题

答案1

可以Java Cassandra 4.x映射器将枚举转换为整数吗？

在C#后端应用程序中是否可以有两个Cassandra会话？

可以立即尝试对远程数据中心进行推测性重试吗？

最佳方法将Cassandra ResultSet转换为Java对象在OSS驱动程序中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论