如何使用Java快速将记录插入Cassandra表格中。

huangapple go评论69阅读模式
英文:

how to use java to very quickly insert records into cassandra table

问题

以下是您要的翻译内容:

我对Cassandra还不熟悉,所以可能有些遗漏。我的目标是尽快插入50万行数据,使用Java(DataStax驱动程序)。目前每秒只插入400条记录,完整的50万次插入需要很多分钟才能完全执行。ArrayList中可能会有重复项,因此插入过程应该执行插入/更新语句(换句话说,Java列表可能包含重复项,但数据库表应只包含不同的值)。

从Cassandra返回的选择查询结果在不到1秒的时间内返回了50万条记录,但插入到Cassandra中却需要很长时间。我希望插入50万条记录的时间能够少于10秒。我该怎么做才能加快插入速度呢?

以下是Cassandra表的定义:

create table mykeyspace.mytablename
(
    my_id_record text primary key
);

以下是Java插入代码(只显示相关部分的代码,为简单起见省略了任何错误处理):

String insertCQL = "INSERT INTO mykeyspace.mytablename(my_id_record) VALUES (?);";
PreparedStatement insertPrepStmnt = session.prepare(insertCQL);
for (String myId : myArrayList) {
    cassandraConnect.session.execute(insertPrepStmnt.bind(myId));
}

正如您所看到的,它将一个字符串值的500,000条记录插入到一个只有一个字段(主键字段)的表中。

400次每秒的插入速度是Cassandra的预期速度吗?

如果您有任何关于如何加速插入速度的建议,我将不胜感激。

英文:

I am new to Cassandra, so I may be missing something. My goal is to insert 500,000 rows as quickly as possible, using Java (DataStax driver). It is currently inserting only 400 records per second, and the full 500,000 inserts is taking many minutes to fully execute. Duplicates in the ArrayList are possible, so the insert process should do an insert/update statement (in other words, the java list might contain duplicates, but the db table should contain only distinct values).

A select-query returns the 500k records in less than 1 second from cassandra, but the insert into cassandra takes a really long time. I am hoping the insert of 500k records could be less than 10 seconds. What can I do to get the inserts to be much faster?

Here is a definition for the Cassandra table:

create table mykeyspace.mytablename
(
	my_id_record text primary key
);

Here is the java insert (just relevant code shown, any error handling removes for simplicity):

String insertCQL = "INSERT INTO mykeyspace.mytablename(my_id_record) VALUES (?);";
PreparedStatement insertPrepStmnt = session.prepare(insertCQL);
for( String myId: myArrayList) {
       cassandraConnect.session.execute(insertPrepStmnt.bind(myId));
}

As you can see, it's inserting 500,00 records of a string value into a table with a single field (the primary key field).

Is 400 inserts per second the expected speed for Cassandra?

Any suggestions for what I can do to speed it up would be greatly appreciated.

答案1

得分: 1

你正在使用同步 API - 这意味着你在插入下一条记录之前要等待答案。通过使用异步 API,你可以获得更高的吞吐量,但你需要控制同一时间在连接中有多少个请求正在处理。你可能需要为此控制/调整连接池

但是,如果你真的想要从文件中加载数据,比如 CSV 或 JSON,我建议你看看DSBulk。如果你只想生成测试数据 - 使用NoSQLBench。这两个工具都经过大量优化,以实现最大吞吐量。

英文:

You are using synchronous API - this means that you wait for answer before inserting next record. You can get much better throughput by using asynchronous API, but you need to control how many requests per connection is in-flight at the same time. You may need to control/tune connection pooling for that.

But if you really want to load data from files, such as CSV or JSON, the I recommend to look to DSBulk. If you want just generate test data - use NoSQLBench. Both tools are heavily optimized for maximum throughput.

huangapple
  • 本文由 发表于 2020年10月17日 02:13:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/64394354.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定