2020年10月17日 02:13:29go评论173阅读模式

英文:

how to use java to very quickly insert records into cassandra table

问题

以下是您要的翻译内容：

我对Cassandra还不熟悉，所以可能有些遗漏。我的目标是尽快插入50万行数据，使用Java（DataStax驱动程序）。目前每秒只插入400条记录，完整的50万次插入需要很多分钟才能完全执行。ArrayList中可能会有重复项，因此插入过程应该执行插入/更新语句（换句话说，Java列表可能包含重复项，但数据库表应只包含不同的值）。

从Cassandra返回的选择查询结果在不到1秒的时间内返回了50万条记录，但插入到Cassandra中却需要很长时间。我希望插入50万条记录的时间能够少于10秒。我该怎么做才能加快插入速度呢？

以下是Cassandra表的定义：

create table mykeyspace.mytablename
(
    my_id_record text primary key
);

以下是Java插入代码（只显示相关部分的代码，为简单起见省略了任何错误处理）：

String insertCQL = "INSERT INTO mykeyspace.mytablename(my_id_record) VALUES (?);";
PreparedStatement insertPrepStmnt = session.prepare(insertCQL);
for (String myId : myArrayList) {
    cassandraConnect.session.execute(insertPrepStmnt.bind(myId));
}

正如您所看到的，它将一个字符串值的500,000条记录插入到一个只有一个字段（主键字段）的表中。

400次每秒的插入速度是Cassandra的预期速度吗？

如果您有任何关于如何加速插入速度的建议，我将不胜感激。

英文:

I am new to Cassandra, so I may be missing something. My goal is to insert 500,000 rows as quickly as possible, using Java (DataStax driver). It is currently inserting only 400 records per second, and the full 500,000 inserts is taking many minutes to fully execute. Duplicates in the ArrayList are possible, so the insert process should do an insert/update statement (in other words, the java list might contain duplicates, but the db table should contain only distinct values).

A select-query returns the 500k records in less than 1 second from cassandra, but the insert into cassandra takes a really long time. I am hoping the insert of 500k records could be less than 10 seconds. What can I do to get the inserts to be much faster?

Here is a definition for the Cassandra table:

create table mykeyspace.mytablename
(
	my_id_record text primary key
);

Here is the java insert (just relevant code shown, any error handling removes for simplicity):

String insertCQL = &quot;INSERT INTO mykeyspace.mytablename(my_id_record) VALUES (?);&quot;;
PreparedStatement insertPrepStmnt = session.prepare(insertCQL);
for( String myId: myArrayList) {
       cassandraConnect.session.execute(insertPrepStmnt.bind(myId));
}

As you can see, it's inserting 500,00 records of a string value into a table with a single field (the primary key field).

Is 400 inserts per second the expected speed for Cassandra?

Any suggestions for what I can do to speed it up would be greatly appreciated.

答案1

得分: 1

你正在使用同步 API - 这意味着你在插入下一条记录之前要等待答案。通过使用异步 API，你可以获得更高的吞吐量，但你需要控制同一时间在连接中有多少个请求正在处理。你可能需要为此控制/调整连接池。

但是，如果你真的想要从文件中加载数据，比如 CSV 或 JSON，我建议你看看DSBulk。如果你只想生成测试数据 - 使用NoSQLBench。这两个工具都经过大量优化，以实现最大吞吐量。

英文:

You are using synchronous API - this means that you wait for answer before inserting next record. You can get much better throughput by using asynchronous API, but you need to control how many requests per connection is in-flight at the same time. You may need to control/tune connection pooling for that.

But if you really want to load data from files, such as CSV or JSON, the I recommend to look to DSBulk. If you want just generate test data - use NoSQLBench. Both tools are heavily optimized for maximum throughput.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Java快速将记录插入Cassandra表格中。

问题

答案1

一般在语义分析期间发生错误：不支持的类文件主版本 57。

Using JGit来修改一个存储库，即使在提交后也没有任何更改。

如何在Vert.x文件系统中读取和解析 .xlsx 文件

如何在这里应用委托？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论