AttributeError: ‘NoneType’ object has no attribute ‘randomSplit’

huangapple go评论99阅读模式
英文:

AttributeError: 'NoneType' object has no attribute 'randomSplit'

问题

我在尝试在pySpark中执行randomSplit时一直收到错误。

我已经添加了这些依赖项:

  1. #步骤1:安装依赖
  2. !apt-get install openjdk-8-jdk-headless -qq > /dev/null
  3. !wget -q https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
  4. !tar xf spark-3.3.0-bin-hadoop3.tgz
  5. !pip install -q findspark
  6. #步骤2:添加环境变量
  7. import os
  8. os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  9. os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"
  10. #步骤3:初始化Pyspark
  11. import findspark
  12. findspark.init()

创建了pySpark环境:

  1. #创建Spark上下文
  2. from pyspark.sql import SparkSession
  3. spark = SparkSession.builder.appName('lr_example').getOrCreate()

并添加了这些:

  1. # 导入VectorAssembler和Vectors
  2. from pyspark.ml.linalg import Vectors
  3. from pyspark.ml.feature import VectorAssembler
  4. from pyspark.ml.regression import LinearRegression

然而,每次我运行以下代码时:

  1. final_df = output.select("features", "medv").show()
  2. train_data, test_data = final_df.randomSplit([0.7, 0.3])

我收到以下错误:

  1. ---------------------------------------------------------------------------
  2. AttributeError Traceback (most recent call last)
  3. <ipython-input-76-e27b8ca71b51> in <cell line: 1>()
  4. ----> 1 train_data, test_data = final_df.randomSplit([0.7, 0.3])
  5. AttributeError: 'NoneType' object has no attribute 'randomSplit'

有什么想法吗?我搜索了需要导入的内容,似乎已经拥有一切,但它无法加载。GitHub文档链接

英文:

I keep receiving an error when trying to randomSplit in pySpark.

I've added these dependencies:

  1. #Step 1: Install Dependencies
  2. !apt-get install openjdk-8-jdk-headless -qq &gt; /dev/null
  3. !wget -q https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
  4. !tar xf spark-3.3.0-bin-hadoop3.tgz
  5. !pip install -q findspark
  6. #Step 2: Add environment variables
  7. import os
  8. os.environ[&quot;JAVA_HOME&quot;] = &quot;/usr/lib/jvm/java-8-openjdk-amd64&quot;
  9. os.environ[&quot;SPARK_HOME&quot;] = &quot;spark-3.3.0-bin-hadoop3&quot;
  10. #Step 3: Initialize Pyspark
  11. import findspark
  12. findspark.init()

Created the pySpark environment:

  1. #creating spark context
  2. from pyspark.sql import SparkSession
  3. spark = SparkSession.builder.appName(&#39;lr_example&#39;).getOrCreate()

and added these:

  1. # Import VectorAssembler and Vectors
  2. from pyspark.ml.linalg import Vectors
  3. from pyspark.ml.feature import VectorAssembler
  4. from pyspark.ml.regression import LinearRegression

However, every time I run this:

  1. final_df = output.select(&quot;features&quot;, &quot;medv&quot;).show()
  2. train_data, test_data = final_df.randomSplit([0.7, 0.3])

I get this:

  1. ---------------------------------------------------------------------------
  2. AttributeError Traceback (most recent call last)
  3. &lt;ipython-input-76-e27b8ca71b51&gt; in &lt;cell line: 1&gt;()
  4. ----&gt; 1 train_data, test_data = final_df.randomSplit([0.7, 0.3])
  5. AttributeError: &#39;NoneType&#39; object has no attribute &#39;randomSplit&#39;

Any ideas? I searched around for what needs to be imported and it seems I have everything but it won't load. Link to Github doc

答案1

得分: 1

final_df = output.select("features", "medv").show()

final_df = output.select("features", "medv") # create df
final_df.show() # print it

英文:

you left out the only important line

final_df = output.select(&quot;features&quot;, &quot;medv&quot;).show()

show prints the results but returns None ... so you are setting final_df to none

instead

  1. final_df = output.select(&quot;features&quot;, &quot;medv&quot;) # create df
  2. final_df.show() # print it

huangapple
  • 本文由 发表于 2023年4月4日 09:56:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75924947.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定