英文:
How do I verify the total rows inserted to a table?
问题
我有一个包含9列的Cassandra表
主键 ((列1,列2),预测日期
我通过从Cassandra v1.0.12读取并使用带有插入查询的Java程序写入到Cassandra v3.x中,已经向这个表中插入了10,000条记录
为了验证目标Cassandra 3主机中的数据,我想要检查在版本3.x中插入的记录数
如果我运行select count(*) from table_name;
它会返回一个庞大的数字2,109,761
而nodetool cfstats显示:键的数量(估计)为6,450
想要了解如何在插入到新版本的Cassandra后验证数据
英文:
I have table in cassandra with 9 columns
primary key ((column1, column2),forecastdate
I have inserted 10,000 records in this table by reading from cassandra v1.0.12 and writing to cassandra v3.x using a java program having insert queries
To validate data in target cassandra 3 host, I want check num of records inserted in ver 3.x
If I run select count(*) from table_name;
it returns me huge number of 2,109,761
While nodetool cfstats shows : Number of Keys (estimated) as 6,450
Want to understand how can I validate data after inserting into new cass version
答案1
得分: 2
cfstats
提供的值在执行压实或刷新之前不准确。
nodetool cfstats命令提供有关一个或多个表的统计信息。当SSTables通过压实或刷新进行更改时,它会更新。
要计算表中的行数,我建议使用免费工具DSBulk
。确实,使用count(*)将在有一定量的数据时很快超时。
dsbulk count --stats.modes global -k myKeyspace -t myTable
英文:
The values provided by cfstats
are not accurate before performing a compaction or a flush.
> The nodetool cfstats command provides statistics about one or more tables. It's updated when SSTables change through compaction or flushing.
To count the number of rows in a table I recommend the free tool DSBulk
. Indeed a count(*) will timeout pretty quickly with a bit of volume.
dsbulk count --stats.modes global -k myKeyspace -t myTable
答案2
得分: 0
内置的CQL函数COUNT()
将返回表中分区的数量,而不仅仅是您最后插入的(前提是它不会超时)。
除非您有一种过滤只计算您插入的数据的方法,否则您使用的任何计数方法都将返回表中的所有记录。
值得一提的是,nodetool cfstats
报告的键数只是一个估计值。如果您感兴趣,我已经解释了为什么它不是准确的计数,详见为什么在Cassandra中使用COUNT()是不好的。
无论如何,在Cassandra中计算记录的更可靠方法是使用DataStax Bulk Loader(DSBulk)工具。它是开源的,因此可以免费使用。它最初是为批量加载数据到Cassandra集群并从中导出数据设计的,作为cqlsh COPY
命令的可扩展解决方案。
DSBulk具有一个count
命令,提供与CQL的COUNT()
函数相同的功能,但具有将表扫描分解为小范围查询的优化,因此不会遭受蛮力计数的相同问题。
DSBulk非常简单易用,只需几分钟即可设置。首先,您需要从DataStax下载中下载二进制文件,然后解压tarball。有关详细信息,请参阅DSBulk安装说明。
安装完成后,您可以使用以下一条命令来计算表中的分区:
$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h <node_ip> -k ks_name -t table_name
以下是一些示例和参考资料,帮助您快速入门:
- 文档 - 在表中计算数据
- 博客 - 使用DSBulk计算记录
- 博客 - DSBulk简介 + 加载数据
英文:
The built-in CQL function COUNT()
will return the number of partitions in the table, not just what you last inserted (provided it doesn't timeout).
Unless you have a way of filtering just the data you inserted, any method you use to count will return all the records in the table.
As a side note, the number of keys reported by nodetool cfstats
is just an estimate. If you're interested, I've explained why it is not an accurate count in Why COUNT() is bad in Cassandra.
In any case, a more reliable way to count records in Cassandra is with the DataStax Bulk Loader (DSBulk) tool. It is open-source so it's free to use. It was originally designed for bulk-loading data to and exporting data from a Cassandra cluster as a scalable solution for the cqlsh COPY
command.
DSBulk has a count
command that provides the same functionality as the CQL COUNT()
function but has optimisations that break up the table scan into small range queries so doesn't suffer from the same problems as brute-force counting.
DSBulk is quite simple to use and only takes a few minutes to setup. First, you need to download the binaries from DataStax Downloads then unpack the tarball. For details, see the DSBulk Installation Instructions.
Once you've got it installed, you can count the partitions in a table with one command:
$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h <node_ip> -k ks_name -t table_name
Here are some references with examples to help you get started quickly:
- Docs - Counting data in tables
- Blog - Counting records with DSBulk
- Blog - DSBulk Intro + Loading data
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论