英文:
Not allowing to set spark.sql.warehouse.dir , it should be set statically for cross-session usages using Java
问题
我正在尝试学习Spark,但在这里我遇到了一个异常
(不允许设置spark.sql.warehouse.dir,应该静态设置用于跨会话使用,异常为 *Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/DistributedFileSystem*)。我在Windows 10电脑上工作。
**主类:**
package com.rakib;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.logging.Level;
import java.util.logging.Logger;
public class App {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "c:/hadoop");
Logger.getLogger("org.apache").setLevel(Level.WARNING);
SparkSession session = SparkSession.builder().appName("SparkSQL").master("local[*]")
.config("spark.sql.warehouse.dir", "file:///c:/temp/")
.getOrCreate();
Dataset<Row> dataSet = session.read().option("header", true).csv("src/main/resources/student.csv");
dataSet.show();
long numberOfRows = dataSet.count();
System.out.println("Total : " + numberOfRows);
session.close();
}
}
**异常:**
20/08/17 12:12:27 INFO SharedState: 正在将 hive.metastore.warehouse.dir('null')设置为 spark.sql.warehouse.dir 的值('file:///c:/temp/')。
20/08/17 12:12:27 INFO SharedState: 仓库路径为 'file:///c:/temp/'。
20/08/17 12:12:27 **WARN SharedState: 不允许在 SparkSession 的选项中设置 spark.sql.warehouse.dir 或 hive.metastore.warehouse.dir,应该静态设置用于跨会话使用
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/DistributedFileSystem
at** org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.listLeafFiles(InMemoryFileIndex.scala:316)
...
**Pom.XML**
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>Test_One</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
</dependencies>
</project>
英文:
I try To learn Spark but in here i found an Exception
(Not allowing to set spark.sql.warehouse.dir, it should be set statically for cross-session usages and exception is Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/DistributedFileSystem). I am working on windows 10 pc.
Main Class:
package com.rakib;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.logging.Level;
import java.util.logging.Logger;
public class App {
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "c:/hadoop");
Logger.getLogger("org.apache").setLevel(Level.WARNING);
SparkSession session = SparkSession.builder().appName("SparkSQL").master("local[*]")
.config("spark.sql.warehouse.dir", "file:///c:/temp/")
.getOrCreate();
Dataset<Row> dataSet = session.read().option("header", true).csv("src/main/resources/student.csv");
dataSet.show();
long numberOfRows = dataSet.count();
System.out.println("Total : " + numberOfRows);
session.close();
}
}
Exception:
20/08/17 12:12:27 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:///c:/temp/').
20/08/17 12:12:27 INFO SharedState: Warehouse path is 'file:///c:/temp/'.
20/08/17 12:12:27 **WARN SharedState: Not allowing to set spark.sql.warehouse.dir or hive.metastore.warehouse.dir in SparkSession's options, it should be set statically for cross-session usages
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/DistributedFileSystem
at** org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.listLeafFiles(InMemoryFileIndex.scala:316)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.$anonfun$bulkListLeafFiles$1(InMemoryFileIndex.scala:195)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:187)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:135)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:98)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:70)
at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:561)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:399)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)
at com.rakib.App.main(App.java:21)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 22 more
20/08/17 12:12:28 INFO SparkContext: Invoking stop() from shutdown hook
20/08/17 12:12:28 INFO SparkUI: Stopped Spark web UI at http://DESKTOP-3147U79:4040
20/08/17 12:12:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/08/17 12:12:28 INFO MemoryStore: MemoryStore cleared
20/08/17 12:12:28 INFO BlockManager: BlockManager stopped
20/08/17 12:12:28 INFO BlockManagerMaster: BlockManagerMaster stopped
20/08/17 12:12:28 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/08/17 12:12:28 INFO SparkContext: Successfully stopped SparkContext
20/08/17 12:12:28 INFO ShutdownHookManager: Shutdown hook called
20/08/17 12:12:28 INFO ShutdownHookManager: Deleting directory C:\Users\itc\AppData\Local\Temp\spark-ab377bad-43d5-48ad-a938-b99234abe546
Pom.XML
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>Test_One</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
</dependencies>
</project>
答案1
得分: 3
请将以下依赖项添加到您的 pom.xml
中,并尝试一下,应该可以正常工作,因为 org.apache.hadoop.hdfs.DistributedFileSystem
类是作为 hadoop-hdfs-client:3.3.0
依赖项的一部分存在的。参考链接:https://repo1.maven.org/maven2/org/apache/hadoop/
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs-client</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.0</version>
</dependency>
更新了您的 pom.xml
的依赖项:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>Test_One</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs-client</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.0</version>
</dependency>
</dependencies>
</project>
所以,请尝试一下,并让我知道您是否可以继续进行 Spark 运行?
英文:
please add the below dependencies with your pom.xml
and give a try and it should work since org.apache.hadoop.hdfs.DistributedFileSystem
class is as part of hadoop-hdfs-client:3.3.0
dependency in reference: https://repo1.maven.org/maven2/org/apache/hadoop/
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs-client</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.0</version>
</dependency>
dependency updated pom.xml
of yours,
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>Test_One</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs-client</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.0</version>
</dependency>
</dependencies>
</project>
so please give a try and let me know If you can proceed with the spark run?
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论