英文:
AttributeError: 'NoneType' object has no attribute 'randomSplit'
问题
我在尝试在pySpark中执行randomSplit时一直收到错误。
我已经添加了这些依赖项:
#步骤1:安装依赖
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark
#步骤2:添加环境变量
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"
#步骤3:初始化Pyspark
import findspark
findspark.init()
创建了pySpark环境:
#创建Spark上下文
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()
并添加了这些:
# 导入VectorAssembler和Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
然而,每次我运行以下代码时:
final_df = output.select("features", "medv").show()
train_data, test_data = final_df.randomSplit([0.7, 0.3])
我收到以下错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-76-e27b8ca71b51> in <cell line: 1>()
----> 1 train_data, test_data = final_df.randomSplit([0.7, 0.3])
AttributeError: 'NoneType' object has no attribute 'randomSplit'
有什么想法吗?我搜索了需要导入的内容,似乎已经拥有一切,但它无法加载。GitHub文档链接
英文:
I keep receiving an error when trying to randomSplit in pySpark.
I've added these dependencies:
#Step 1: Install Dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark
#Step 2: Add environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"
#Step 3: Initialize Pyspark
import findspark
findspark.init()
Created the pySpark environment:
#creating spark context
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()
and added these:
# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
However, every time I run this:
final_df = output.select("features", "medv").show()
train_data, test_data = final_df.randomSplit([0.7, 0.3])
I get this:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-76-e27b8ca71b51> in <cell line: 1>()
----> 1 train_data, test_data = final_df.randomSplit([0.7, 0.3])
AttributeError: 'NoneType' object has no attribute 'randomSplit'
Any ideas? I searched around for what needs to be imported and it seems I have everything but it won't load. Link to Github doc
答案1
得分: 1
final_df = output.select("features", "medv").show()
final_df = output.select("features", "medv") # create df
final_df.show() # print it
英文:
you left out the only important line
final_df = output.select("features", "medv").show()
show prints the results but returns None ... so you are setting final_df to none
instead
final_df = output.select("features", "medv") # create df
final_df.show() # print it
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论