英文:
pytest unittest spark java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset
问题
使用pytest运行pyspark代码的单元测试。以下是来自给定代码的代码片段示例。看起来需要Spark运行时或Hadoop运行时库,但我认为单元测试实际上不需要Spark库。只需pyspark Python包足够,因为像Jenkins这样的工具不会安装Spark运行时。请指导
def read_inputfile_from_ADLS(self):
try:
if self.segment == "US":
if self.input_path_2 is None or self.input_path_2 == "":
df = self.spark.read.format("delta").load(self.input_path)
else:
df = self.spark.read.format("delta").load(self.input_path_2)
except Exception as e:
resultmsg = "error reading input file"
# Pytest code
import pytest
from unittest.mock import patch, MagicMock, Mock
class TestInputPreprocessor:
inpprcr = None
dataframe_reader = 'pyspark.sql.readwriter.DataFrameReader'
def test_read_inputfile_from_ADLS(self, spark, tmp_path):
self.segment = 'US'
self.input_path_2 = tmp_path
with patch(f'{self.dataframe_reader}.format', MagicMock(autospec=True)) as mock_adls_read:
self.inpprcr.read_inputfile_from_ADLS()
assert mock_adls_read.call_count == 1
错误:
AssertionError
---------------------------------------------- Captured stderr setup -------------------
---------------------------
23/07/12 23:58:42 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException:
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see
https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
23/07/12 23:58:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
英文:
Running unit testing using pytest for pyspark code. Code snippet sample from code given below. Looks like spark runtime or hadoop runtime libraries expected , but i thought unit testing does not really need spark libraries. Just pyspark python package is enough because tools like Jenkins won't have spark runtime installed. Please guide
def read_inputfile_from_ADLS(self):
try:
if self.segment == "US":
if self.input_path_2 is None or self.input_path_2 == "":
df = self.spark.read.format("delta").load(self.input_path)
else:
df = self.spark.read.format("delta").load(self.input_path_2)
except Exception as e:
resultmsg = "error reading input file"
Pytest code
import pytest
from unittest.mock import patch,MagicMock , Mock
class TestInputPreprocessor:
inpprcr = None
dataframe_reader = 'pyspark.sql.readwriter.DataFrameReader'
def test_read_inputfile_from_ADLS(self,spark,tmp_path):
self.segment = 'US'
self.input_path_2 = tmp_path
with patch(f'{self.dataframe_reader}.format', MagicMock(autospec=True)) as
mock_adls_read:
self.inpprcr.read_inputfile_from_ADLS()
assert mock_adls_read.call_count == 1
Error:
AssertionError
---------------------------------------------- Captured stderr setup -------------------
---------------------------
23/07/12 23:58:42 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException:
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see
https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
23/07/12 23:58:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
答案1
得分: 1
已解决此问题。必须下载winutils.exe并映射到HADOOP_HOME,SPARK_HOME到Python lib中的pyspark位置
'C:\Users<networkid>\AppData\Local\Programs\Python\Python310\Lib\site-packages\pyspark'
本地笔记本电脑上无需安装Hadoop或Spark进行单元测试。
英文:
Fixed this issue. Have to download winutils.exe and map to HADOOP_HOME , SPARK_HOME to pyspark location in python lib
'C:\Users<networkid>\AppData\Local\Programs\Python\Python310\Lib\site-packages\pyspark'
No need to install Hadoop or Spark on local laptop for unit testing
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论