为什么sbt不编译代码,而且主要库甚至都没有被识别?

huangapple go评论57阅读模式
英文:

why sbt doesn't compile the code and the main libraries aren't even recognised?

问题

This code demonstrates a Spark application that performs language ranking based on Wikipedia articles. It utilizes RDDs for distributed processing and leverages Spark's parallel processing capabilities. Nevertheless, it appears an error of:

Extracting structure failed: Build status: Error

and none of the imports work when I try to run the Scala script:

package wikipedia

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD

case class WikipediaArticle(title: String, text: String)

object WikipediaRanking {

  val langs = List(
    "JavaScript", "Java", "PHP", "Python", "C#", "C++", "Ruby", "CSS",
    "Objective-C", "Perl", "Scala", "Haskell", "MATLAB", "Clojure", "Groovy")

  val conf: SparkConf = new SparkConf().setAppName("wikipedia").setMaster("local[*]")
  val sc: SparkContext = new SparkContext(conf)
  sc.setLogLevel("WARN")
  // Hint: use a combination of `sc.textFile`, `WikipediaData.filePath`, and `WikipediaData.parse`
  val wikiRdd: RDD[WikipediaArticle] = sc.textFile(WikipediaData.filePath).map(l => WikipediaData.parse(l)).cache()

  /** Returns the number of articles on which the language `lang` occurs.
    *  Hint1: consider using method `aggregate` on RDD[T].
    *  Hint2: should you count the "Java" language when you see "JavaScript"?
    *  Hint3: the only whitespaces are blanks " "
    *  Hint4: no need to search in the title :)
    */
  def occurrencesOfLang(lang: String, rdd: RDD[WikipediaArticle]): Int = {
    rdd.aggregate(0)((sum, article) => sum + isFound(article, lang), _+_)
  }

  def isFound(article: WikipediaArticle, lang: String): Int = if(article.text.split(" ").contains(lang)) 1 else 0

  /* (1) Use `occurrencesOfLang` to compute the ranking of the languages
   *     (`val langs`) by determining the number of Wikipedia articles that
   *     mention each language at least once. Don't forget to sort the
   *     languages by their occurrence, in decreasing order!
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangs(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    val ranks = langs.map(lang => (lang, occurrencesOfLang(lang, rdd)))
    //for{ lang <- langs; occ = occurrencesOfLang(lang, rdd) if occ != 0} yield (lang, occ)
    ranks.sortBy(_._2).reverse
  }

  /* Compute an inverted index of the set of articles, mapping each language
     * to the Wikipedia pages in which it occurs.
     */
  def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] = {
    val list = rdd.flatMap(article => for( lang <- langs if isFound(article, lang) == 1) yield (lang, article))
    list.groupByKey()
  }

  /* (2) Compute the language ranking again, but now using the inverted index. Can you notice
   *     a performance improvement?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsUsingIndex(index: RDD[(String, Iterable[WikipediaArticle])]): List[(String, Int)] = {
    val ranks = index.mapValues(_.size).collect().toList.sortBy(-_._2)
    ranks
  }


  /* (3) Use `reduceByKey` so that the computation of the index and the ranking are combined.
   *     Can you notice an improvement in performance compared to measuring *both* the computation of the index
   *     and the computation of the ranking? If so, can you think of a reason?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsReduceByKey(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    val list = rdd.flatMap(article => for( lang <- langs if isFound(article, lang) == 1) yield (lang, 1))
    list.reduceByKey(_+_).collect().toList.sortBy(_._2).reverse
  }

  def main(args: Array[String]) {

    /* Languages ranked according to (1) */
    val langsRanked: List[(String, Int)] = timed("Part 1: naive ranking", rankLangs(langs, wikiRdd))
    langsRanked.foreach(println)

    /* An inverted index mapping languages to Wikipedia pages on which they appear */
    def index: RDD[(String, Iterable[WikipediaArticle])] = makeIndex(langs, wikiRdd)

    /* Languages ranked according to (2), using the inverted index */
    val langsRanked2: List[(String, Int)] = timed("Part 2: ranking using inverted index", rankLangsUsingIndex(index))
    langsRanked2.foreach(println)

    /* Languages ranked according to (3) */
    val langsRanked3: List[(String, Int)] = timed("Part 3: ranking using reduceByKey", rankLangsReduceByKey(langs, wikiRdd))
    langsRanked3.foreach(println)

    /* Output the speed of each ranking */
    println(timing)
    sc.stop()
  }

  val timing = new StringBuffer
  def timed[T](label: String, code: => T): T = {
    val start = System.currentTimeMillis()
    val result = code
    val stop = System.currentTimeMillis()
    timing.append(s"Processing $label took ${stop - start} ms.\n")
    result
  }
}

Also, I share the build.sbt and the WikipediaData object:

name := "YourProjectName"

version := "1.0"

scalaVersion := "2.11.8"

scalacOptions ++= Seq("-deprecation")

lazy val courseId = settingKey[String]("Course ID")
courseId := "e8VseYIYEeWxQQoymFg8zQ"

resolvers += Resolver.sonatypeRepo("releases")

libraryDependencies ++= Seq(
  "org.scala-sbt" % "sbt" % "1.1.6",
  "org.apache.spark" %% "spark-core" % "2.1.0",
  "org.apache.spark" %% "spark-sql" % "2.1.0",
  "org.apache.commons" % "commons-lang3" % "3.12.0", // Apache Commons Lang
  "jline" % "jline" % "2.14.6"
)
package wikipedia

import java.io.File

object WikipediaData {

  private[wikipedia] def filePath = {
    new File(this.getClass.getClassLoader.getResource("wikipedia/wikipedia.dat").toURI).getPath
  }

 

<details>
<summary>英文:</summary>

This code demonstrates a Spark application that performs language ranking based on Wikipedia articles. It utilizes RDDs for distributed processing and leverages Spark&#39;s parallel processing capabilities. Nevertheless, it appears an error of 

&gt; Extracting structure failed: Build status: Error 
&gt;
&gt; sbt task failed, see log for details

and none of the imports work when I try to run the scala script:


```scala
package wikipedia

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD

case class WikipediaArticle(title: String, text: String)

object WikipediaRanking {

  val langs = List(
    &quot;JavaScript&quot;, &quot;Java&quot;, &quot;PHP&quot;, &quot;Python&quot;, &quot;C#&quot;, &quot;C++&quot;, &quot;Ruby&quot;, &quot;CSS&quot;,
    &quot;Objective-C&quot;, &quot;Perl&quot;, &quot;Scala&quot;, &quot;Haskell&quot;, &quot;MATLAB&quot;, &quot;Clojure&quot;, &quot;Groovy&quot;)

  val conf: SparkConf = new SparkConf().setAppName(&quot;wikipedia&quot;).setMaster(&quot;local[*]&quot;)
  val sc: SparkContext = new SparkContext(conf)
  sc.setLogLevel(&quot;WARN&quot;)
  // Hint: use a combination of `sc.textFile`, `WikipediaData.filePath` and `WikipediaData.parse`
  val wikiRdd: RDD[WikipediaArticle] = sc.textFile(WikipediaData.filePath).map(l =&gt; WikipediaData.parse(l)).cache()

  /** Returns the number of articles on which the language `lang` occurs.
    *  Hint1: consider using method `aggregate` on RDD[T].
    *  Hint2: should you count the &quot;Java&quot; language when you see &quot;JavaScript&quot;?
    *  Hint3: the only whitespaces are blanks &quot; &quot;
    *  Hint4: no need to search in the title :)
    */
  def occurrencesOfLang(lang: String, rdd: RDD[WikipediaArticle]): Int = {
    rdd.aggregate(0)((sum, article) =&gt; sum + isFound(article, lang), _+_)
  }

  def isFound(article: WikipediaArticle, lang: String): Int = if(article.text.split(&quot; &quot;).contains(lang)) 1 else 0

  /* (1) Use `occurrencesOfLang` to compute the ranking of the languages
   *     (`val langs`) by determining the number of Wikipedia articles that
   *     mention each language at least once. Don&#39;t forget to sort the
   *     languages by their occurrence, in decreasing order!
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangs(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    val ranks = langs.map(lang =&gt; (lang, occurrencesOfLang(lang, rdd)))
    //for{ lang &lt;- langs; occ = occurrencesOfLang(lang, rdd) if occ != 0} yield (lang, occ)
    ranks.sortBy(_._2).reverse
  }

  /* Compute an inverted index of the set of articles, mapping each language
     * to the Wikipedia pages in which it occurs.
     */
  def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] = {
    val list = rdd.flatMap(article =&gt; for( lang &lt;- langs if isFound(article, lang) == 1) yield (lang, article))
    list.groupByKey()
  }

  /* (2) Compute the language ranking again, but now using the inverted index. Can you notice
   *     a performance improvement?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsUsingIndex(index: RDD[(String, Iterable[WikipediaArticle])]): List[(String, Int)] = {
    val ranks = index.mapValues(_.size).collect().toList.sortBy(-_._2)
    ranks
  }


  /* (3) Use `reduceByKey` so that the computation of the index and the ranking are combined.
   *     Can you notice an improvement in performance compared to measuring *both* the computation of the index
   *     and the computation of the ranking? If so, can you think of a reason?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsReduceByKey(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    val list = rdd.flatMap(article =&gt; for( lang &lt;- langs if isFound(article, lang) == 1) yield (lang, 1))
    list.reduceByKey(_+_).collect().toList.sortBy(_._2).reverse
  }

  def main(args: Array[String]) {

    /* Languages ranked according to (1) */
    val langsRanked: List[(String, Int)] = timed(&quot;Part 1: naive ranking&quot;, rankLangs(langs, wikiRdd))
    langsRanked.foreach(println)

    /* An inverted index mapping languages to wikipedia pages on which they appear */
    def index: RDD[(String, Iterable[WikipediaArticle])] = makeIndex(langs, wikiRdd)

    /* Languages ranked according to (2), using the inverted index */
    val langsRanked2: List[(String, Int)] = timed(&quot;Part 2: ranking using inverted index&quot;, rankLangsUsingIndex(index))
    langsRanked2.foreach(println)

    /* Languages ranked according to (3) */
    val langsRanked3: List[(String, Int)] = timed(&quot;Part 3: ranking using reduceByKey&quot;, rankLangsReduceByKey(langs, wikiRdd))
    langsRanked3.foreach(println)

    /* Output the speed of each ranking */
    println(timing)
    sc.stop()
  }

  val timing = new StringBuffer
  def timed[T](label: String, code: =&gt; T): T = {
    val start = System.currentTimeMillis()
    val result = code
    val stop = System.currentTimeMillis()
    timing.append(s&quot;Processing $label took ${stop - start} ms.\n&quot;)
    result
  }
}

Also, I share the build.sbt and the WikipediaData object:

name := &quot;YourProjectName&quot;

version := &quot;1.0&quot;

scalaVersion := &quot;2.11.8&quot;

scalacOptions ++= Seq(&quot;-deprecation&quot;)

lazy val courseId = settingKey[String](&quot;Course ID&quot;)
courseId := &quot;e8VseYIYEeWxQQoymFg8zQ&quot;

resolvers += Resolver.sonatypeRepo(&quot;releases&quot;)

libraryDependencies ++= Seq(
  &quot;org.scala-sbt&quot; % &quot;sbt&quot; % &quot;1.1.6&quot;,
  &quot;org.apache.spark&quot; %% &quot;spark-core&quot; % &quot;2.1.0&quot;,
  &quot;org.apache.spark&quot; %% &quot;spark-sql&quot; % &quot;2.1.0&quot;,
  &quot;org.apache.commons&quot; % &quot;commons-lang3&quot; % &quot;3.12.0&quot;, // Apache Commons Lang
  &quot;jline&quot; % &quot;jline&quot; % &quot;2.14.6&quot;
)
package wikipedia

import java.io.File

object WikipediaData {

  private[wikipedia] def filePath = {
    new File(this.getClass.getClassLoader.getResource(&quot;wikipedia/wikipedia.dat&quot;).toURI).getPath
  }

  private[wikipedia] def parse(line: String): WikipediaArticle = {
    val subs = &quot;&lt;/title&gt;&lt;text&gt;&quot;
    val i = line.indexOf(subs)
    val title = line.substring(14, i)
    val text  = line.substring(i + subs.length, line.length-16)
    WikipediaArticle(title, text)
  }
}

I tried to try the different versions of JDK (now using OpenJDK 1.8), checked for the compatible packages of Scala, Spark and Sbt and managed to declare all these tools on "Project File" menu. Also, tried to test the similar solutions on other Stack OverFlow responses, but doesn't work yet.

Edit: The messages of error are the following ones:

  1. Whenever I "load SBT changes" or prompt "sbt run", the result is the following:

为什么sbt不编译代码,而且主要库甚至都没有被识别?

  1. Afterwards, I run the project on WikipediaRanking.scala, but obtain the following:

为什么sbt不编译代码,而且主要库甚至都没有被识别?

答案1

得分: 2

由于以下库的存在,您会收到以下错误消息:

如果您查看sbt 1.1.6的编译依赖关系,您将发现其中一个依赖项是scala_2.12.6。这将与spark-corespark-sql创建冲突,因为没有适用于scala 2.12的版本(当您创建一个Scala库时,可以为不同的Scala版本发布它,这称为交叉构建)。因此,当sbt尝试查找正确的传递依赖关系时,它会说我需要同时为scala 2.112.12获取scala-xmlscala-parser-combinators,这是不可能的。

您有两种解决此问题的方法:

降级sbt到依赖于scala 2.11的版本

根据您的项目名称,似乎您正在参加一门课程,目标是学习apache-spark用于scala而不是sbt。在这种情况下,依赖于scala 2.11的sbt的最后一个版本是0.99.4。如果您必须下载项目示例以进行练习,并且这些示例依赖于在sbt 1.x.x中添加的功能,也许您可以尝试sbt 1.0.0-M4,它仍然依赖于scala 2.11

升级spark-corespark-sql到与scala 2.12兼容的版本

scala 2.12兼容的spark-corespark-sql的第一个版本是2.4.0,与您选择的版本差距不大。不确定引入了哪些更改。如果您正在跟随教程,可能会发现如何执行某些操作方面存在一些差异,直到找到如何执行它为止,可能会阻碍您的学习。

此外,scala 2.11有点老旧。可用版本页面显示,2.11(最后发布于2017年11月9日)和2.12(最后发布于2023年6月7日)正在维护中。只有scala 2.13.x3正在开发中。这意味着如果您尝试使用旧版本的scala,您可能需要处理与您现在遇到的类似的库不兼容性问题。

也许有一个新的课程,包含更新版本的scalaspark。今天的最新版本是3.4.1,它与scala 2.12scala 2.13兼容。尚无适用于scala 3的官方版本。

英文:

You are getting the error

[error] Modules were resolved with conflicting cross-version suffixes in ProjectRef(uri(&quot;file:/path/to/your/project/&quot;), &quot;name-of-the-project&quot;):
[error]    org.scala-lang.modules:scala-xml _2.12, _2.11
[error]    org.scala-lang.modules:scala-parser-combinators _2.11, _2.12

because of the following libraries:

If you look at the compile dependencies of sbt 1.1.6, you will find that one of the dependencies is scala_2.12.6. This will create conflicts with spark-core and spark-sql because there is no release for scala 2.12 (when you create a scala library, you can release it for different scala versions which is called cross building). So, when sbt tries to find the correct transitive dependency, it says I need scala-xml and scala-parser-combinators for scala 2.11 and 2.12 at the same time, which is not possible.


You have two ways of solving this issue:

Downgrade sbt to a version that depends on scala 2.11

Based on the name of your project, sounds like you are taking a course and the goal is to learn apache-spark for scala and not sbt. In this case, the last release for sbt depending on scala 2.11 was 0.99.4. If the project examples that you have to download to do some exercises depends on features that were added in sbt 1.x.x, maybe you could try sbt 1.0.0-M4 that stills depends on scala 2.11.

Upgrade spark-core and spark-sql to a version compatible with scala 2.12

The first release compatible with scala 2.12 for spark-core and spark-sql is 2.4.0 which is not far away from the one that you have selected. Not sure which changes were introduced. If you are following a tutorial, some differences could be found on how to do some stuff, blocking your learning until you figure out how to do it.


In addition to this, scala 2.11 is a bit old. The available versions page shows that 2.11 (last release was on November 9, 2017) and 2.12 (last release was on June 7, 2023) are being maintained. Only scala 2.13.x and 3 are being developed. This means that if you try to use an old version of scala, you might have to deal with incompatibilities between libraries like the one you are having now.

Maybe there is a new course with updated versions of scala and spark. The latest version for spark today is 3.4.1, which is compatible with scala 2.12 and scala 2.13. No official release for scala 3 yet.

huangapple
  • 本文由 发表于 2023年6月26日 00:11:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76551337.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定