如何验证插入到表中的总行数?

huangapple go评论63阅读模式
英文:

How do I verify the total rows inserted to a table?

问题

我有一个包含9列的Cassandra表
主键 ((列1,列2),预测日期

我通过从Cassandra v1.0.12读取并使用带有插入查询的Java程序写入到Cassandra v3.x中,已经向这个表中插入了10,000条记录

为了验证目标Cassandra 3主机中的数据,我想要检查在版本3.x中插入的记录数

如果我运行select count(*) from table_name;
它会返回一个庞大的数字2,109,761

而nodetool cfstats显示:键的数量(估计)为6,450

想要了解如何在插入到新版本的Cassandra后验证数据

英文:

I have table in cassandra with 9 columns
primary key ((column1, column2),forecastdate

I have inserted 10,000 records in this table by reading from cassandra v1.0.12 and writing to cassandra v3.x using a java program having insert queries

To validate data in target cassandra 3 host, I want check num of records inserted in ver 3.x

If I run select count(*) from table_name;
it returns me huge number of 2,109,761

While nodetool cfstats shows : Number of Keys (estimated) as 6,450

Want to understand how can I validate data after inserting into new cass version

答案1

得分: 2

cfstats提供的值在执行压实或刷新之前不准确。

nodetool cfstats命令提供有关一个或多个表的统计信息。当SSTables通过压实或刷新进行更改时,它会更新。

来源

要计算表中的行数,我建议使用免费工具DSBulk。确实,使用count(*)将在有一定量的数据时很快超时。

dsbulk count --stats.modes global -k myKeyspace -t myTable

DSBulk参考文档

英文:

The values provided by cfstats are not accurate before performing a compaction or a flush.

> The nodetool cfstats command provides statistics about one or more tables. It's updated when SSTables change through compaction or flushing.

source

To count the number of rows in a table I recommend the free tool DSBulk. Indeed a count(*) will timeout pretty quickly with a bit of volume.

dsbulk count --stats.modes global -k myKeyspace -t myTable

DSBulk reference documentation

答案2

得分: 0

内置的CQL函数COUNT()将返回表中分区的数量,而不仅仅是您最后插入的(前提是它不会超时)。

除非您有一种过滤只计算您插入的数据的方法,否则您使用的任何计数方法都将返回表中的所有记录。

值得一提的是,nodetool cfstats报告的键数只是一个估计值。如果您感兴趣,我已经解释了为什么它不是准确的计数,详见为什么在Cassandra中使用COUNT()是不好的

无论如何,在Cassandra中计算记录的更可靠方法是使用DataStax Bulk Loader(DSBulk)工具。它是开源的,因此可以免费使用。它最初是为批量加载数据到Cassandra集群并从中导出数据设计的,作为cqlsh COPY命令的可扩展解决方案。

DSBulk具有一个count命令,提供与CQL的COUNT()函数相同的功能,但具有将表扫描分解为小范围查询的优化,因此不会遭受蛮力计数的相同问题。

DSBulk非常简单易用,只需几分钟即可设置。首先,您需要从DataStax下载中下载二进制文件,然后解压tarball。有关详细信息,请参阅DSBulk安装说明

安装完成后,您可以使用以下一条命令来计算表中的分区:

$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h <node_ip> -k ks_name -t table_name

以下是一些示例和参考资料,帮助您快速入门:

英文:

The built-in CQL function COUNT() will return the number of partitions in the table, not just what you last inserted (provided it doesn't timeout).

Unless you have a way of filtering just the data you inserted, any method you use to count will return all the records in the table.

As a side note, the number of keys reported by nodetool cfstats is just an estimate. If you're interested, I've explained why it is not an accurate count in Why COUNT() is bad in Cassandra.

In any case, a more reliable way to count records in Cassandra is with the DataStax Bulk Loader (DSBulk) tool. It is open-source so it's free to use. It was originally designed for bulk-loading data to and exporting data from a Cassandra cluster as a scalable solution for the cqlsh COPY command.

DSBulk has a count command that provides the same functionality as the CQL COUNT() function but has optimisations that break up the table scan into small range queries so doesn't suffer from the same problems as brute-force counting.

DSBulk is quite simple to use and only takes a few minutes to setup. First, you need to download the binaries from DataStax Downloads then unpack the tarball. For details, see the DSBulk Installation Instructions.

Once you've got it installed, you can count the partitions in a table with one command:

$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h &lt;node_ip&gt; -k ks_name -t table_name

Here are some references with examples to help you get started quickly:

huangapple
  • 本文由 发表于 2023年6月18日 18:53:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500168.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定