观察到使用Cassandra JDBC驱动程序提取记录时发生数据丢失。

huangapple go评论63阅读模式
英文:

Observed data loss while fetching records using Cassandra JDBC driver

问题

表结构:testkeyspace.productInfo(productID、productName、timestamp、price)

  • 上述表中的分区键列是“productId”。
  • 表 -> testkeyspace.productInfo 包含 10k 条记录

测试案例

  • 测试包括两个 Java 应用程序
    1. 应用程序 1 -> 将数据写入 Cassandra 表(testkeyspace.productInfo
    2. 应用程序 2 -> 持续以 5 秒的间隔从 Cassandra 表(testkeyspace.productInfo)中读取数据
  • 分区键(productID)对于所有记录都是唯一的。因此,
    分区数 = 表中的记录数

如何从 Cassandra 表中读取数据?(读取模式)

  1. 最初,使用 Cassandra JDBC 向 Cassandra 表发出“SELECT *”命令
  2. 从表的 timestamp 列中记录所获取记录的最新时间戳(lastNotedTimestamp)
  3. 接下来的“SELECT *”将带有 timestamp > lastNotedTimestamp 的 WHERE 条件
  4. 步骤 2 和 3 会重复执行,直到应用程序停止

问题

  1. 当步骤 4 与写入应用程序并行执行时,从 JDBC 调用返回的结果集中会丢失一些随机记录。
  2. 当步骤 4 在写入应用程序完成其任务后执行时,不会丢失记录,获取成功。

注意:即使在选择查询的 WHERE 子句中使用基于数字的列,上述问题也适用。

已在单节点集群中进行了测试和观察。

可以有人解释这种行为并告诉我为什么会发生吗?谢谢。

英文:

Table Structure : testkeyspace.productInfo (productID, productName, timestamp, price)
- Partition key column in the above table is “productId”.
- table -> testkeyspace.productInfo consists of 10k records

Test Case

  • Test consists of two Java applications
    1. Application 1 -> Writes data to Cassandra table (testkeyspace.productInfo)
    2. Application 2 -> Reads from Cassandra table (testkeyspace.productInfo) continuously with 5 seconds interval
  • Partition Key (productID) is unique for all records. Hence,
    Number of partitions = Number of records in table

How data is read from Cassandra table? (Read Pattern)

  1. Initially, "SELECT *" is issued to Cassandra table with the help of Cassandra JDBC
  2. Latest Timestamp (lastNotedTimestamp) of the fetched records is noted from the timestamp column of the table
  3. The subsequent "SELECT *" are issued with a WHERE condition of timestamp > lastNotedTimestamp
  4. Step 2 and 3 are repeated until application kill.

Issue

  1. While Step 4 is executed in parallel with the write application, few random records are missing in the ResultSet returned from the JDBC call.
  2. While Step 4 is executed after write application has completed its task, records are not missed and the fetching is successful.

Note: Above mentioned issue is applicable even when using a numeric-based column in the WHERE clause of the select query.

It's tested and observed in the single node cluster.

Can someone explain this behaviour and let me know why this is happening?

Thanks in advance.

答案1

得分: 1

I will only provide the translated content without additional information or responses:

在执行第 4 步时,与写入应用程序并行执行,从 JDBC 调用返回的 ResultSet 中缺少少量随机记录。

有人可以解释这种行为,并告诉我为什么会发生这种情况吗?

当您查询 Cassandra 时没有指定分区键时,无法保证查询可以由单个节点提供服务。这意味着(对于这两个查询)将选择一个节点作为协调者,以详尽查询其他节点。一旦该节点在 RAM 中组装了结果集,它就会将其返回给应用程序。

非分区键查询会导致 Cassandra 节点工作非常努力。现在,考虑到集群还被要求处理写入应用程序的吞吐量,我不感到意外,您会遇到问题。

基本上,这是因为当前的读/写模式对您的集群来说需要太多资源。

您可以:

  • 设计一个特定的表以适应您的时间戳查询模式。
  • 使用多线程方法进行写入,通过限制写入应用程序中的并发写入线程数量来减小写入吞吐量。
  • 永远不要使用 SELECT *(没有 WHERE 子句)或 ALLOW FILTERING 指令。

编辑 20230518

我尝试在 WHERE 子句中使用分区键列,例如 "SELECT * FROM keyspace.table WHERE partitionkeyColumn > latestPartitionColumnObservedPosition ALLOW FILTERING"。我观察到这种情况下仍然会随机丢失一些记录。

是的,因为该查询仍然会对所有节点造成压力。分区键需要与等式操作符一起指定。当然,它获得了一个分区键。但大于 latestPartitionColumnObservedPosition 的所有分区键仍将分散在多个节点上,因此没有改善。

根据上面提供的表定义,此表可以支持一个性能良好的查询:

SELECT * FROM productInfo WHERE productId=?;

就这样。如果在该表上运行任何其他查询,它将会影响多个节点并使它们工作努力。

如果关注的是 timestamp,那么您可以尝试使用“分桶”方法构建一个新表,如下所示:

CREATE TABLE productinfo_by_day_bucket (
    productid TEXT,
    daybucket INT,
    productname TEXT,
    producttimestamp TIMESTAMP,
    price DECIMAL,
    PRIMARY KEY (daybucket, producttimestamp, productid)
) WITH CLUSTERING ORDER BY (producttimestamp DESC, productid ASC);

这将在一个分区中存储某一天更新的产品,这意味着它们可以像这样查询:

SELECT * FROM productinfo_by_day_bucket
WHERE daybucket=20230518
AND producttimestamp > '2023-05-18 14:10:05.705000+0000';

根据特定日期更新的产品数量,该“时间桶”可能需要使用不同的度量单位。但这取决于业务需求。这还允许在时间戳上进行筛选,只要它在一个分区内。

请注意,producttimestamp 被指定为聚簇键,按降序排序。这是因为大多数基于时间的应用程序查询通常关心最新的数据。按降序排序允许查询快速从分区的“顶部”提取记录,这将更快。

productid 是最后的聚簇键,但实际上只是一个用来确保唯一性的分隔符。

您可以查看 DataStax Academy,特别是 DS220,这是我们关于 Cassandra 数据建模的免费课程。它真的很好地解释了一切是如何在内部工作的,并提供了数据建模和查询构建策略来利用它。

英文:

> While Step 4 is executed in parallel with the write application, few random records are missing in the ResultSet returned from the JDBC call.

> Can someone explain this behaviour and let me know why this is happening?

Sure. The problem is with the queries:

> Initially, "SELECT *" is issued to Cassandra table

No mention of a WHERE clause, so I'll assume that there isn't one.

> The subsequent "SELECT *" are issued with a WHERE condition of timestamp > lastNotedTimestamp

So this one does have a WHERE clause. However, I don't see the partition key inside of it.

When you query Cassandra without specifying a partition key, there are no guarantees that the query can be served by a single node. This means (for both queries) that one node is being selected as a coordinator, to query the other nodes exhaustively. Once that node has assembled the result set (in RAM) it then returns it to the application.

Non-partition key queries cause Cassandra nodes to work very hard. Now, add into the mix that the cluster is also being asked it to handle the throughput of the write application, and I'm not surprised that you're seeing issues.

Essentially, this is happening because the current read/write patterns are too resource-intensive for your cluster to handle.

You can:

  • Design a specific table to suit your timestamp query pattern.
  • Use a multi-threaded approach to writing, throttling down the write throughput by limiting the number of concurrent write threads in the write application.
  • Never use SELECT * (without a WHERE clause) or the ALLOW FILTERING directive.

Edit 20230518

> i tried with the partition key column in the where clause like "SELECT * FROM keyspace.table where partitionkeyColumn > latestPartitionColumnObservedPosition ALLOW FILTERING". I observed few record missing randomly for this case also.

Yes, because that query is still causing stress on all of the nodes. The partition key needs to be specified with an equality operator. Sure, it's getting a partition key. But all of the partition keys greater than latestPartitionColumnObservedPosition will still be spread across multiple nodes, so nothing is improving.

Given the table definition provided above, this table can support one performant query:

SELECT * FROM productInfo WHERE productId=?;

That's it. If there's any other query running on that table, it's going to be hitting multiple nodes and causing them to work hard.

If the concern is about the timestamp, then you might try building a new table with a "bucketing" approach, like this:

CREATE TABLE productinfo_by_day_bucket (
    productid TEXT,
    daybucket INT,
    productname TEXT,
    producttimestamp TIMESTAMP,
    price DECIMAL,
    PRIMARY KEY (daybucket, producttimestamp, productid)
WITH CLUSTERING ORDER BY (producttimestamp DESC, productid ASC);

This will store products updated for a particular day in a single partition, meaning that they can be queried like:

SELECT * FROM productinfo_by_day_bucket
WHERE daybucket=20230518
AND producttimestamp > '2023-05-18 14:10:05.705000+0000';

Depending on how many products are updated in a particular day, that "time bucket" may need to be a different unit of measure. But that's up to the business requirements. This also allows filtering on timestamp, as long as it's within a partition.

Note that producttimestamp is specified as a clustering key, sorting in descending order. This is because most time-based application queries tend to care about the most-recent data.
Sorting in descending order allows queries to quickly pull records from the "top" of the partition, which will be faster.

productid is the last clustering key, but that's really just there as a tie-breaker to ensure uniqueness.

You should check out DataStax Academy, especially DS220 which is our free class on Cassandra data modeling. It really does a good job of explaining how everything works under-the-hood, and gives you data modeling and query building strategies to leverage that.

答案2

得分: 0

你所看到的“数据丢失”很可能是副本之间的不一致性。

可能发生的情况是节点无法跟上写入速度,导致丢弃变异,从而导致不一致的副本。

如果尚未这样做,您需要定期运行修复操作。但长期解决方案是向集群中添加更多节点,以便节点可以跟上负载。

此外,我同意Aaron的观点,您需要重新设计数据模型,因为该表对于您的应用查询并不合适。

顺便说一句,如果您有一个Java应用程序,使用JDBC驱动程序是没有意义的。您应该切换到使用Cassandra Java驱动程序,以便可以使用CQL对数据库进行本地查询。

如果您想学习如何做到这一点,我们在datastax.com/dev上有许多免费教程和课程。祝好运!

英文:

What you see as "data loss" is most likely inconsistency between replicas.

What is probably happening is that the nodes cannot keep up with writes and dropping mutations leading to inconsistent replicas.

You need to run repairs regularly if you aren't already. But the long term solution is to add more nodes to your cluster so the nodes can keep up with the load.

In addition, I agree with Aaron that you need to redesign your data model because the table is not a good fit for your app query.

As a side note, it doesn't make sense to use the JDBC driver when you have a Java application. You should switch to using the Cassandra Java driver so you can natively query the database with CQL.

If you want to learn how, we have a lot of free tutorials and courses at datastax.com/dev. Cheers!

huangapple
  • 本文由 发表于 2023年5月17日 18:45:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/76271246.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定