英文:
Gremlin Spark Java Maven Project - Slow query response
问题
我已经编写了一个程序,以便在Gremlin之上执行一些查询(我使用Janus Graph作为引擎,使用Cassandra和Solr),并借助于Spark,但查询结果非常缓慢。
很可能是我没有正确设置某些内容。
以下是我使用的代码。
驱动程序:
import org.apache.commons.configuration.Configuration;
import org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer;
import org.apache.tinkerpop.gremlin.structure.Graph;
import org.apache.tinkerpop.gremlin.structure.util.GraphFactory;
public class SimpleApp {
public static void main(String[] args) throws Exception {
Configuration config = GraphTraversalProvider.makeLocal();
Graph hadoopGraph = GraphFactory.open(config);
Long totalVertices = hadoopGraph.traversal().withComputer(SparkGraphComputer.class).V().count().next();
System.out.println("IT WORKED: " + totalVertices);
hadoopGraph.close();
}
}
GraphTraversalProvider类
import org.apache.commons.configuration.BaseConfiguration;
import org.apache.commons.configuration.Configuration;
import org.apache.tinkerpop.gremlin.hadoop.Constants;
public class GraphTraversalProvider {
private static final String KEY_SPACE = "janusgraph";
private static final String CASSANDRA_ADDRESS = "localhost";
public static Configuration makeLocal() {
return make(true);
}
public static Configuration makeRemote() {
return make(false);
}
private static Configuration make(boolean local) {
final Configuration hadoopConfig = new BaseConfiguration();
// 设置其他配置项...
return hadoopConfig;
}
}
pom.xml
<!-- POM文件的内容 -->
输出如下所示:
23:36:29,708 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to WARN
23:36:29,708 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [CONSOLE] to Logger[ROOT]
...
IT WORKED: 43
23:39:32.111 [Thread-3] WARN org.apache.spark.SparkContext - Ignoring Exception while stopping SparkContext from shutdown hook
...
Process finished with exit code 123
因此,我获得了正确的输出:IT WORKED: 43
,43是总顶点数,但花费的时间太长。
另外,此日志消息:
23:36:33.279 [SparkGraphComputer-boss] WARN o.a.t.g.s.p.c.SparkGraphComputer - HADOOP_GREMLIN_LIBS is not set -- proceeding regardless
强调我可能没有正确设置某些内容。
=================================================================
更新:2023年10月27日
将程序提交到一个带有一个从节点的Spark集群,而不是在本地IDE上运行,将运行时间从6分钟降低到3分钟。
英文:
I have wrote a program in order to perform some queries on top of Gremlin (I use Janus Graph with Cassandra and Solr as the engine) with the help of Spark, but the query result is terrible slow.
Most probably I have setup something not correctly.
Here is the code I have used.
Driver program:
import org.apache.commons.configuration.Configuration;
import org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer;
import org.apache.tinkerpop.gremlin.structure.Graph;
import org.apache.tinkerpop.gremlin.structure.util.GraphFactory;
public class SimpleApp {
public static void main(String[] args) throws Exception {
Configuration config = GraphTraversalProvider.makeLocal();
Graph hadoopGraph = GraphFactory.open(config);
Long totalVertices = hadoopGraph.traversal().withComputer(SparkGraphComputer.class).V().count().next();
System.out.println("IT WORKED: " + totalVertices);
hadoopGraph.close();
}
}
GraphTraversalProvider class
import org.apache.commons.configuration.BaseConfiguration;
import org.apache.commons.configuration.Configuration;
import org.apache.tinkerpop.gremlin.hadoop.Constants;
public class GraphTraversalProvider {
private static final String KEY_SPACE = "janusgraph";
private static final String CASSANDRA_ADDRESS = "localhost";
public static Configuration makeLocal() {
return make(true);
}
public static Configuration makeRemote() {
return make(false);
}
private static Configuration make(boolean local) {
final Configuration hadoopConfig = new BaseConfiguration();
hadoopConfig.setProperty("gremlin.graph", "org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph");
hadoopConfig.setProperty(Constants.GREMLIN_HADOOP_GRAPH_READER, "org.janusgraph.hadoop.formats.cql.CqlInputFormat");
hadoopConfig.setProperty(Constants.GREMLIN_HADOOP_GRAPH_WRITER, "org.apache.hadoop.mapreduce.lib.output.NullOutputFormat");
hadoopConfig.setProperty(Constants.GREMLIN_HADOOP_JARS_IN_DISTRIBUTED_CACHE, true);
hadoopConfig.setProperty(Constants.GREMLIN_HADOOP_INPUT_LOCATION, "none");
hadoopConfig.setProperty(Constants.GREMLIN_HADOOP_OUTPUT_LOCATION, "output");
hadoopConfig.setProperty(Constants.GREMLIN_SPARK_PERSIST_CONTEXT, true);
hadoopConfig.setProperty("janusgraphmr.ioformat.conf.storage.backend", "cql");
hadoopConfig.setProperty("janusgraphmr.ioformat.conf.storage.hostname", CASSANDRA_ADDRESS);
hadoopConfig.setProperty("janusgraphmr.ioformat.conf.storage.port", "9042");
hadoopConfig.setProperty("janusgraphmr.ioformat.conf.storage.cassandra.keyspace", KEY_SPACE);
hadoopConfig.setProperty("cassandra.input.partitioner.class", "org.apache.cassandra.dht.Murmur3Partitioner");
hadoopConfig.setProperty("cassandra.input.widerows", true);
if (local) {
hadoopConfig.setProperty("spark.master", "local[*]"); // Run Spark locally with as many worker threads as logical cores on your machine.
} else {
hadoopConfig.setProperty("spark.master", "spark://ADD_YOUR_URL");
}
hadoopConfig.setProperty("spark.executor.memory", "2g");
hadoopConfig.setProperty(Constants.SPARK_SERIALIZER, "org.apache.spark.serializer.KryoSerializer");
hadoopConfig.setProperty("spark.kryo.registrator", "org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator");
hadoopConfig.setProperty("storage.hostname", CASSANDRA_ADDRESS);
hadoopConfig.setProperty("storage.cassandra.keyspace", KEY_SPACE);
return hadoopConfig;
}
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.ibc</groupId>
<artifactId>sparkdemo</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<janus.version>0.5.1</janus.version>
<spark.version>2.4.0</spark.version>
<gremlin.version>3.4.6</gremlin.version>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.2</version>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-cassandra</artifactId>
<version>${janus.version}</version>
</dependency>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-hadoop</artifactId>
<version>${janus.version}</version>
</dependency>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-cql</artifactId>
<version>${janus.version}</version>
</dependency>
<dependency>
<groupId>org.janusgraph</groupId>
<artifactId>janusgraph-solr</artifactId>
<version>${janus.version}</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.2.3</version>
</dependency>
<!-- GREMLIN -->
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>spark-gremlin</artifactId>
<version>${gremlin.version}</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</exclusion>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>hadoop-gremlin</artifactId>
<version>${gremlin.version}</version>
</dependency>
<!-- SPARK -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>
</dependencies>
</project>
The output is the following:
23:36:29,708 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to WARN
23:36:29,708 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [CONSOLE] to Logger[ROOT]
23:36:29,708 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
23:36:29,710 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@704d6e83 - Registering current configuration as safe fallback point
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
23:36:30.225 [main] WARN o.a.t.g.s.p.c.SparkGraphComputer - class org.apache.hadoop.mapreduce.lib.output.NullOutputFormat does not implement PersistResultGraphAware and thus, persistence options are unknown -- assuming all options are possible
23:36:30.516 [SparkGraphComputer-boss] WARN org.apache.spark.util.Utils - Your hostname, nchristidis-GL502VMK resolves to a loopback address: 127.0.1.1; using 192.168.1.12 instead (on interface wlp3s0)
23:36:30.516 [SparkGraphComputer-boss] WARN org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to another address
23:36:32.191 [SparkGraphComputer-boss] WARN o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23:36:33.279 [SparkGraphComputer-boss] WARN o.a.t.g.s.p.c.SparkGraphComputer - HADOOP_GREMLIN_LIBS is not set -- proceeding regardless
23:36:35.266 [SparkGraphComputer-boss] WARN com.datastax.driver.core.NettyUtil - Found Netty's native epoll transport in the classpath, but epoll is not available. Using NIO instead.
IT WORKED: 43
23:39:32.111 [Thread-3] WARN org.apache.spark.SparkContext - Ignoring Exception while stopping SparkContext from shutdown hook
java.lang.NoSuchMethodError: io.netty.bootstrap.ServerBootstrap.config()Lio/netty/bootstrap/ServerBootstrapConfig;
at org.apache.spark.network.server.TransportServer.close(TransportServer.java:154)
at org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:180)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1615)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:90)
at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1974)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1973)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:575)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
Process finished with exit code 123
So I get the correct output: IT WORKED: 43
43 are the total vertices, but it takes too long.
Also this log message:
23:36:33.279 [SparkGraphComputer-boss] WARN o.a.t.g.s.p.c.SparkGraphComputer - HADOOP_GREMLIN_LIBS is not set -- proceeding regardless
highlights that most probably I have not setup something correctly.
=================================================================
<br/>
Update: Tuesday, 27th of October
By submitting the program to a spark cluster with one slave node, and not running it via IDE locally, I have a significant drop from 6 minutes to 3 minutes.
答案1
得分: 3
基于OLAP的Gremlin遍历,即使对于小型数据集,速度也会比标准的OLTP遍历慢得多。仅在启动Spark以处理遍历时就会产生相当大的成本。仅仅这种开销就可能使得你的OLAP查询比OLTP多出1分钟的劣势。根据你在问题评论中的解释,你的查询大约需要六分钟。这似乎有点长,但在OLAP环境中可能是正常范围内的,取决于你的环境。
一些图数据库会针对OLAP的count()
进行优化,从而获得相当快速的结果,但你在问题中标记了"JanusGraph",所以我认为这不适用于这里。
通常情况下,只有在开始考虑大规模图时,你才会看到基于OLAP的遍历的价值。将OLAP和OLTP中的100多万条边进行比较,你可能不介意等待六分钟来获得答案(因为OLTP可能根本无法完成)。
很难说如何使你当前的设置更快,因为你现在只是在证明事情能够工作。既然你有了一个可行的模型,我建议下一步是生成一个规模更大的图(也许是1000万个顶点),然后在一个相当大的Spark集群上再次尝试计数。
英文:
OLAP based Gremlin traversals will be much slower than standard OLTP traversals even for small datasets. There is considerable cost just in getting Spark primed up to process your traversal. That overhead alone might easily give your OLAP query a 1 minute handicap over OLTP. In the comments to your question you explained that your query is taking around six minutes. That does seem a bit on the long side but maybe in the realm of normal for OLAP depending on your environment??
Some graph will optimize for an OLAP count()
and get you a pretty speedy result but you tagged this question with "JanusGraph" so I don't think that applies here.
You typically don't see the value of OLAP based traversals until you start concerning yourself with large scale graphs. Compare counting 100+ million edges in OLAP versus OLTP and you probably won't mind waiting six minutes for an answer at all (as OLTP might not finish at all).
It's hard to say what you might do to make your current setup faster as you are really just proving things work at this point. Now that you have a working model, I would suggest that the next step would be to generate a significantly larger graph (10 million vertices maybe) and try your count again with a decent sized spark cluster.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论