2023年6月26日 00:11:35go评论99阅读模式

英文:

why sbt doesn't compile the code and the main libraries aren't even recognised?

问题

This code demonstrates a Spark application that performs language ranking based on Wikipedia articles. It utilizes RDDs for distributed processing and leverages Spark's parallel processing capabilities. Nevertheless, it appears an error of:

Extracting structure failed: Build status: Error

and none of the imports work when I try to run the Scala script:

package wikipedia
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
case class WikipediaArticle(title: String, text: String)
object WikipediaRanking {
  val langs = List(
    "JavaScript", "Java", "PHP", "Python", "C#", "C++", "Ruby", "CSS",
    "Objective-C", "Perl", "Scala", "Haskell", "MATLAB", "Clojure", "Groovy")
  val conf: SparkConf = new SparkConf().setAppName("wikipedia").setMaster("local[*]")
  val sc: SparkContext = new SparkContext(conf)
  sc.setLogLevel("WARN")
  // Hint: use a combination of `sc.textFile`, `WikipediaData.filePath`, and `WikipediaData.parse`
  val wikiRdd: RDD[WikipediaArticle] = sc.textFile(WikipediaData.filePath).map(l => WikipediaData.parse(l)).cache()
  /** Returns the number of articles on which the language `lang` occurs.
    *  Hint1: consider using method `aggregate` on RDD[T].
    *  Hint2: should you count the "Java" language when you see "JavaScript"?
    *  Hint3: the only whitespaces are blanks " "
    *  Hint4: no need to search in the title :)
    */
  def occurrencesOfLang(lang: String, rdd: RDD[WikipediaArticle]): Int = {
    rdd.aggregate(0)((sum, article) => sum + isFound(article, lang), _+_)
  }
  def isFound(article: WikipediaArticle, lang: String): Int = if(article.text.split(" ").contains(lang)) 1 else 0
  /* (1) Use `occurrencesOfLang` to compute the ranking of the languages
   *     (`val langs`) by determining the number of Wikipedia articles that
   *     mention each language at least once. Don't forget to sort the
   *     languages by their occurrence, in decreasing order!
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangs(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    val ranks = langs.map(lang => (lang, occurrencesOfLang(lang, rdd)))
    //for{ lang <- langs; occ = occurrencesOfLang(lang, rdd) if occ != 0} yield (lang, occ)
    ranks.sortBy(_._2).reverse
  }
  /* Compute an inverted index of the set of articles, mapping each language
     * to the Wikipedia pages in which it occurs.
     */
  def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] = {
    val list = rdd.flatMap(article => for( lang <- langs if isFound(article, lang) == 1) yield (lang, article))
    list.groupByKey()
  }
  /* (2) Compute the language ranking again, but now using the inverted index. Can you notice
   *     a performance improvement?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsUsingIndex(index: RDD[(String, Iterable[WikipediaArticle])]): List[(String, Int)] = {
    val ranks = index.mapValues(_.size).collect().toList.sortBy(-_._2)
    ranks
  }
  /* (3) Use `reduceByKey` so that the computation of the index and the ranking are combined.
   *     Can you notice an improvement in performance compared to measuring *both* the computation of the index
   *     and the computation of the ranking? If so, can you think of a reason?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsReduceByKey(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    val list = rdd.flatMap(article => for( lang <- langs if isFound(article, lang) == 1) yield (lang, 1))
    list.reduceByKey(_+_).collect().toList.sortBy(_._2).reverse
  }
  def main(args: Array[String]) {
    /* Languages ranked according to (1) */
    val langsRanked: List[(String, Int)] = timed("Part 1: naive ranking", rankLangs(langs, wikiRdd))
    langsRanked.foreach(println)
    /* An inverted index mapping languages to Wikipedia pages on which they appear */
    def index: RDD[(String, Iterable[WikipediaArticle])] = makeIndex(langs, wikiRdd)
    /* Languages ranked according to (2), using the inverted index */
    val langsRanked2: List[(String, Int)] = timed("Part 2: ranking using inverted index", rankLangsUsingIndex(index))
    langsRanked2.foreach(println)
    /* Languages ranked according to (3) */
    val langsRanked3: List[(String, Int)] = timed("Part 3: ranking using reduceByKey", rankLangsReduceByKey(langs, wikiRdd))
    langsRanked3.foreach(println)
    /* Output the speed of each ranking */
    println(timing)
    sc.stop()
  }
  val timing = new StringBuffer
  def timed[T](label: String, code: => T): T = {
    val start = System.currentTimeMillis()
    val result = code
    val stop = System.currentTimeMillis()
    timing.append(s"Processing $label took ${stop - start} ms.\n")
    result
  }
}

Also, I share the build.sbt and the WikipediaData object:

name := "YourProjectName"
version := "1.0"
scalaVersion := "2.11.8"
scalacOptions ++= Seq("-deprecation")
lazy val courseId = settingKey[String]("Course ID")
courseId := "e8VseYIYEeWxQQoymFg8zQ"
resolvers += Resolver.sonatypeRepo("releases")
libraryDependencies ++= Seq(
  "org.scala-sbt" % "sbt" % "1.1.6",
  "org.apache.spark" %% "spark-core" % "2.1.0",
  "org.apache.spark" %% "spark-sql" % "2.1.0",
  "org.apache.commons" % "commons-lang3" % "3.12.0", // Apache Commons Lang
  "jline" % "jline" % "2.14.6"
)

package wikipedia
import java.io.File
object WikipediaData {
  private[wikipedia] def filePath = {
    new File(this.getClass.getClassLoader.getResource("wikipedia/wikipedia.dat").toURI).getPath
  }
 
<details>
<summary>英文:</summary>
This code demonstrates a Spark application that performs language ranking based on Wikipedia articles. It utilizes RDDs for distributed processing and leverages Spark&#39;s parallel processing capabilities. Nevertheless, it appears an error of 
&gt; Extracting structure failed: Build status: Error 
&gt;
&gt; sbt task failed, see log for details
and none of the imports work when I try to run the scala script:
```scala
package wikipedia
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
case class WikipediaArticle(title: String, text: String)
object WikipediaRanking {
  val langs = List(
    &quot;JavaScript&quot;, &quot;Java&quot;, &quot;PHP&quot;, &quot;Python&quot;, &quot;C#&quot;, &quot;C++&quot;, &quot;Ruby&quot;, &quot;CSS&quot;,
    &quot;Objective-C&quot;, &quot;Perl&quot;, &quot;Scala&quot;, &quot;Haskell&quot;, &quot;MATLAB&quot;, &quot;Clojure&quot;, &quot;Groovy&quot;)
  val conf: SparkConf = new SparkConf().setAppName(&quot;wikipedia&quot;).setMaster(&quot;local[*]&quot;)
  val sc: SparkContext = new SparkContext(conf)
  sc.setLogLevel(&quot;WARN&quot;)
  // Hint: use a combination of `sc.textFile`, `WikipediaData.filePath` and `WikipediaData.parse`
  val wikiRdd: RDD[WikipediaArticle] = sc.textFile(WikipediaData.filePath).map(l =&gt; WikipediaData.parse(l)).cache()
  /** Returns the number of articles on which the language `lang` occurs.
    *  Hint1: consider using method `aggregate` on RDD[T].
    *  Hint2: should you count the &quot;Java&quot; language when you see &quot;JavaScript&quot;?
    *  Hint3: the only whitespaces are blanks &quot; &quot;
    *  Hint4: no need to search in the title :)
    */
  def occurrencesOfLang(lang: String, rdd: RDD[WikipediaArticle]): Int = {
    rdd.aggregate(0)((sum, article) =&gt; sum + isFound(article, lang), _+_)
  }
  def isFound(article: WikipediaArticle, lang: String): Int = if(article.text.split(&quot; &quot;).contains(lang)) 1 else 0
  /* (1) Use `occurrencesOfLang` to compute the ranking of the languages
   *     (`val langs`) by determining the number of Wikipedia articles that
   *     mention each language at least once. Don&#39;t forget to sort the
   *     languages by their occurrence, in decreasing order!
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangs(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    val ranks = langs.map(lang =&gt; (lang, occurrencesOfLang(lang, rdd)))
    //for{ lang &lt;- langs; occ = occurrencesOfLang(lang, rdd) if occ != 0} yield (lang, occ)
    ranks.sortBy(_._2).reverse
  }
  /* Compute an inverted index of the set of articles, mapping each language
     * to the Wikipedia pages in which it occurs.
     */
  def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] = {
    val list = rdd.flatMap(article =&gt; for( lang &lt;- langs if isFound(article, lang) == 1) yield (lang, article))
    list.groupByKey()
  }
  /* (2) Compute the language ranking again, but now using the inverted index. Can you notice
   *     a performance improvement?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsUsingIndex(index: RDD[(String, Iterable[WikipediaArticle])]): List[(String, Int)] = {
    val ranks = index.mapValues(_.size).collect().toList.sortBy(-_._2)
    ranks
  }
  /* (3) Use `reduceByKey` so that the computation of the index and the ranking are combined.
   *     Can you notice an improvement in performance compared to measuring *both* the computation of the index
   *     and the computation of the ranking? If so, can you think of a reason?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsReduceByKey(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] = {
    val list = rdd.flatMap(article =&gt; for( lang &lt;- langs if isFound(article, lang) == 1) yield (lang, 1))
    list.reduceByKey(_+_).collect().toList.sortBy(_._2).reverse
  }
  def main(args: Array[String]) {
    /* Languages ranked according to (1) */
    val langsRanked: List[(String, Int)] = timed(&quot;Part 1: naive ranking&quot;, rankLangs(langs, wikiRdd))
    langsRanked.foreach(println)
    /* An inverted index mapping languages to wikipedia pages on which they appear */
    def index: RDD[(String, Iterable[WikipediaArticle])] = makeIndex(langs, wikiRdd)
    /* Languages ranked according to (2), using the inverted index */
    val langsRanked2: List[(String, Int)] = timed(&quot;Part 2: ranking using inverted index&quot;, rankLangsUsingIndex(index))
    langsRanked2.foreach(println)
    /* Languages ranked according to (3) */
    val langsRanked3: List[(String, Int)] = timed(&quot;Part 3: ranking using reduceByKey&quot;, rankLangsReduceByKey(langs, wikiRdd))
    langsRanked3.foreach(println)
    /* Output the speed of each ranking */
    println(timing)
    sc.stop()
  }
  val timing = new StringBuffer
  def timed[T](label: String, code: =&gt; T): T = {
    val start = System.currentTimeMillis()
    val result = code
    val stop = System.currentTimeMillis()
    timing.append(s&quot;Processing $label took ${stop - start} ms.\n&quot;)
    result
  }
}

Also, I share the build.sbt and the WikipediaData object:

name := &quot;YourProjectName&quot;
version := &quot;1.0&quot;
scalaVersion := &quot;2.11.8&quot;
scalacOptions ++= Seq(&quot;-deprecation&quot;)
lazy val courseId = settingKey[String](&quot;Course ID&quot;)
courseId := &quot;e8VseYIYEeWxQQoymFg8zQ&quot;
resolvers += Resolver.sonatypeRepo(&quot;releases&quot;)
libraryDependencies ++= Seq(
  &quot;org.scala-sbt&quot; % &quot;sbt&quot; % &quot;1.1.6&quot;,
  &quot;org.apache.spark&quot; %% &quot;spark-core&quot; % &quot;2.1.0&quot;,
  &quot;org.apache.spark&quot; %% &quot;spark-sql&quot; % &quot;2.1.0&quot;,
  &quot;org.apache.commons&quot; % &quot;commons-lang3&quot; % &quot;3.12.0&quot;, // Apache Commons Lang
  &quot;jline&quot; % &quot;jline&quot; % &quot;2.14.6&quot;
)

package wikipedia
import java.io.File
object WikipediaData {
  private[wikipedia] def filePath = {
    new File(this.getClass.getClassLoader.getResource(&quot;wikipedia/wikipedia.dat&quot;).toURI).getPath
  }
  private[wikipedia] def parse(line: String): WikipediaArticle = {
    val subs = &quot;&lt;/title&gt;&lt;text&gt;&quot;
    val i = line.indexOf(subs)
    val title = line.substring(14, i)
    val text  = line.substring(i + subs.length, line.length-16)
    WikipediaArticle(title, text)
  }
}

I tried to try the different versions of JDK (now using OpenJDK 1.8), checked for the compatible packages of Scala, Spark and Sbt and managed to declare all these tools on "Project File" menu. Also, tried to test the similar solutions on other Stack OverFlow responses, but doesn't work yet.

Edit: The messages of error are the following ones:

Whenever I "load SBT changes" or prompt "sbt run", the result is the following:

为什么sbt不编译代码，而且主要库甚至都没有被识别？

Afterwards, I run the project on WikipediaRanking.scala, but obtain the following:

为什么sbt不编译代码，而且主要库甚至都没有被识别？

答案1

得分: 2

由于以下库的存在，您会收到以下错误消息：

如果您查看sbt 1.1.6的编译依赖关系，您将发现其中一个依赖项是scala_2.12.6。这将与spark-core和spark-sql创建冲突，因为没有适用于scala 2.12的版本（当您创建一个Scala库时，可以为不同的Scala版本发布它，这称为交叉构建）。因此，当sbt尝试查找正确的传递依赖关系时，它会说我需要同时为scala 2.11和2.12获取scala-xml和scala-parser-combinators，这是不可能的。

您有两种解决此问题的方法：

降级sbt到依赖于scala 2.11的版本

根据您的项目名称，似乎您正在参加一门课程，目标是学习apache-spark用于scala而不是sbt。在这种情况下，依赖于scala 2.11的sbt的最后一个版本是0.99.4。如果您必须下载项目示例以进行练习，并且这些示例依赖于在sbt 1.x.x中添加的功能，也许您可以尝试sbt 1.0.0-M4，它仍然依赖于scala 2.11。

升级`spark-core`和`spark-sql`到与scala 2.12兼容的版本

与scala 2.12兼容的spark-core和spark-sql的第一个版本是2.4.0，与您选择的版本差距不大。不确定引入了哪些更改。如果您正在跟随教程，可能会发现如何执行某些操作方面存在一些差异，直到找到如何执行它为止，可能会阻碍您的学习。

此外，scala 2.11有点老旧。可用版本页面显示，2.11（最后发布于2017年11月9日）和2.12（最后发布于2023年6月7日）正在维护中。只有scala 2.13.x和3正在开发中。这意味着如果您尝试使用旧版本的scala，您可能需要处理与您现在遇到的类似的库不兼容性问题。

也许有一个新的课程，包含更新版本的scala和spark。今天的最新版本是3.4.1，它与scala 2.12和scala 2.13兼容。尚无适用于scala 3的官方版本。

英文:

You are getting the error

[error] Modules were resolved with conflicting cross-version suffixes in ProjectRef(uri(&quot;file:/path/to/your/project/&quot;), &quot;name-of-the-project&quot;):
[error]    org.scala-lang.modules:scala-xml _2.12, _2.11
[error]    org.scala-lang.modules:scala-parser-combinators _2.11, _2.12

because of the following libraries:

If you look at the compile dependencies of sbt 1.1.6, you will find that one of the dependencies is scala_2.12.6. This will create conflicts with spark-core and spark-sql because there is no release for scala 2.12 (when you create a scala library, you can release it for different scala versions which is called cross building). So, when sbt tries to find the correct transitive dependency, it says I need scala-xml and scala-parser-combinators for scala 2.11 and 2.12 at the same time, which is not possible.

You have two ways of solving this issue:

Downgrade sbt to a version that depends on scala 2.11

Based on the name of your project, sounds like you are taking a course and the goal is to learn apache-spark for scala and not sbt. In this case, the last release for sbt depending on scala 2.11 was 0.99.4. If the project examples that you have to download to do some exercises depends on features that were added in sbt 1.x.x, maybe you could try sbt 1.0.0-M4 that stills depends on scala 2.11.

Upgrade `spark-core` and `spark-sql` to a version compatible with scala 2.12

The first release compatible with scala 2.12 for spark-core and spark-sql is 2.4.0 which is not far away from the one that you have selected. Not sure which changes were introduced. If you are following a tutorial, some differences could be found on how to do some stuff, blocking your learning until you figure out how to do it.

In addition to this, scala 2.11 is a bit old. The available versions page shows that 2.11 (last release was on November 9, 2017) and 2.12 (last release was on June 7, 2023) are being maintained. Only scala 2.13.x and 3 are being developed. This means that if you try to use an old version of scala, you might have to deal with incompatibilities between libraries like the one you are having now.

Maybe there is a new course with updated versions of scala and spark. The latest version for spark today is 3.4.1, which is compatible with scala 2.12 and scala 2.13. No official release for scala 3 yet.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么sbt不编译代码，而且主要库甚至都没有被识别？

问题

答案1

降级sbt到依赖于scala 2.11的版本

升级`spark-core`和`spark-sql`到与scala 2.12兼容的版本

Downgrade sbt to a version that depends on scala 2.11

Upgrade `spark-core` and `spark-sql` to a version compatible with scala 2.12

根据不同列值从不同的数据框中复制值。

java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset /Spark – JAVA

如何在pyspark中重命名嵌套列内的列

如何根据要求，在SPARK AZURE-DATABRICKS中使用SCALA将JSON对象转换为列的值

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论

问题

答案1

降级sbt到依赖于scala 2.11的版本

升级spark-core和spark-sql到与scala 2.12兼容的版本

Downgrade sbt to a version that depends on scala 2.11

Upgrade spark-core and spark-sql to a version compatible with scala 2.12

发表评论

升级`spark-core`和`spark-sql`到与scala 2.12兼容的版本

Upgrade `spark-core` and `spark-sql` to a version compatible with scala 2.12