英文:
pyspark on Anaconda: ] was unexpected at this time
问题
我正在按照此页面的步骤,在Windows 10上的Anaconda中安装PySpark。在验证PySpark的第6步中,遇到了无法找到Python的问题。我发现这个答案最初帮助我进展到看到PySpark标语的点上。下面是我在Anaconda提示符(而不是Anaconda Powershell提示符)中使用的解决方案的命令:
set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
# set PYTHONPATH=C:\Users\<user>\anaconda3\pkgs\pyspark-3.4.0-pyhd8ed1ab_0\site-packages
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
pyspark
如上所示,必须修改PYTHONPATH以匹配我自己安装中的文件夹树。我假设在安装期间,Conda选择了与当前py39
环境中的Python 3.9兼容的PySpark版本,以满足包依赖关系。我使用此版本以与其他人兼容。
在此之后,PySpark首次运行,但出现了许多错误(请参见下文附录)。由于我对Python、Anaconda和PySpark都很陌生,我发现这些错误至少令人困惑。但正如附录中所示,我确实得到了Spark标语和Python提示符。
作为排除错误的第一步,我尝试关闭并重新打开Conda提示符窗口。但是,第二次运行pyspark
时出现了不同的错误,同样令人困惑。
第2次运行的pyspark输出:
set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
pyspark
] was unexpected at this time.
为了追踪此不同错误消息的原因,我搜索了当我发出pyspark
命令时执行的文件。以下是候选文件:
where pyspark
C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark
C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark.cmd
我注意到第一个脚本pyspark
是一个Bash脚本,所以“] was unexpected at this time.”并不令人惊讶。我假设第二个脚本pyspark.cmd
是为了从Windows的CMD解释器中调用的,而Conda提示符是通过设置某些环境变量来定制的,例如:通过运行pyspark.cmd
,但它生成了与Bash脚本相同的错误“] was unexpected at this time.”除了@echo off
之外,pyspark.cmd
中唯一的命令是cmd /V /E /C "&"%~dp0pyspark2.cmd" %*"
,这对我来说无法解释。
这个问题似乎很奇怪,Bash脚本pyspark
设置为在Windows上的Conda环境中运行。这是因为在运行pyspark
之前设置上述3个环境变量时引起的基本无意义吗?
运行pyspark.cmd
为什么会生成与运行Bash脚本相同的错误?
故障排除
我追踪了第2次错误消息到C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd
。它由pyspark.cmd
调用,同时生成了意外的]
错误:
cd C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts
psypark2.cmd
] was unexpected at this time.
为了找到有问题的语句,我手动发出了pyspark2.cmd
中的每个命令,但没有得到相同的错误。除了REM语句外,以下是pyspark2.cmd
的内容:
REM `C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd`
REM -------------------------------------------------------------
@echo off
rem Figure out where the Spark framework is installed
call "%~dp0find-spark-home.cmd"
call "%SPARK_HOME%\bin\load-spark-env.cmd"
set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options]
rem Figure out which Python to use.
if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
set PYSPARK_DRIVER_PYTHON=python
if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON%
)
set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%
set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py
call "%SPARK_HOME%\bin\spark-submit2.cmd" pyspark-shell-main --name "PySparkShell" %*
以下是上述命令的调整版,稍作修改以考虑在交互式提示符而不是在脚本文件中执行:
REM ~/tmp/tmp.cmd mirrors pyspark2.cmd
REM ----------------------------------
REM Note that %SPARK_HOME%==
REM "c:\Users\%USERNAME%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages\pyspark"
cd C:\Users\%USERNAME%\anaconda3\
<details>
<summary>英文:</summary>
I am following [this page](https://sparkbyexamples.com/pyspark/install-pyspark-in-anaconda-jupyter-notebook) to install PySpark in Anaconda on Windows 10. In step #6 for validating PySpark, Python [could not be found](https://stackoverflow.com/questions/73823644). I found that [this answer](https://stackoverflow.com/a/76552980/2153235) initially helped me progress to the point of seeing the PySpark banner. Here is my adaption of the solution in the form of commands issued at the Anaconda prompt (not the Anaconda Powershell prompt):
set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
# set PYTHONPATH=C:\Users\<user>\anaconda3\pkgs\pyspark-3.4.0-pyhd8ed1ab_0\site-packages
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
pyspark
As shown above, the PYTHONPATH had to be modified to match the folder tree in my own installation. Essentially, I searched for a folder in `c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0` named `site-packages`. I assume that the PySpark version was selected by Conda during installation to satisify package dependencies in the current `py39` environment, which contains Python 3.9. I use this version for compatibility with others.
PySpark ran for the *1st time* after this, but with many, many errors (see Annex below). As I am new to Python, Anaconda, and PySpark, I find the errors to be confusing to say the least. As shown in the Annex, however, I did get the Spark banner and the Python prompt.
As my very first step to troubleshooting the errors, I tried closing and reopening the Conda prompt window. However, the error from this *2nd run* of `pyspark` was *different* -- and equally confusing.
**pyspark output from *2nd* run:**
set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
pyspark
] was unexpected at this time.
To trace the cause of this different error message, I searched for the file that is executed when I issue `pyspark`. Here are the candidate files:
where pyspark
C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark
C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark.cmd
I noted that the 1st script `pyspark` is a *Bash* script, so it's not suprising that "] was unexpected at this time." I assumed that the 2nd script `pyspark.cmd` is for invocation from Windows's CMD interpreter, of which the Conda prompt is a customization, e.g., by setting certain environment variables. Therefore, I ran `pyspark.cmd`, but it generated the same error "] was unexpected at this time." Apart from `@echo off`, the only command in `pyspark.cmd` is `cmd /V /E /C ""%~dp0pyspark2.cmd" %*"`, which is indecipherable to me.
***It seems odd that the Bash script `pyspark` is set up to run in a Conda environment on Windows. Is this caused by a fundamental nonsensicality in setting the 3 environment variables above prior to running `pyspark`?***
*And why would running `pyspark.cmd` generate the same error as running the Bash script?*
# Troubleshooting
I tracked the 2nd error message down to `C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd`. It is invoked by `pyspark.cmd` and also generates the unexpected `]` error:
cd C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts
psypark2.cmd
] was unexpected at this time.
To locate the problematic statement, I manually issued each command in `pyspark2.cmd` but did *not* get the same error. Apart from REM statements, here is `pyspark2.cmd`:
REM `C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd`
REM -------------------------------------------------------------
@echo off
rem Figure out where the Spark framework is installed
call "%~dp0find-spark-home.cmd"
call "%SPARK_HOME%\bin\load-spark-env.cmd"
set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options]
rem Figure out which Python to use.
if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
set PYSPARK_DRIVER_PYTHON=python
if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON%
)
set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%
set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py
call "%SPARK_HOME%\bin\spark-submit2.cmd" pyspark-shell-main --name "PySparkShell" %*
Here is my palette of the above commands, slightly modified to account for the fact that they are executing at an interactive prompt rather than from within a script file:
REM ~/tmp/tmp.cmd mirrors pyspark2.cmd
REM ----------------------------------
REM Note that %SPARK_HOME%==
REM "c:\Users\%USERNAME%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages\pyspark"
cd C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts
call "find-spark-home.cmd"
call "%SPARK_HOME%\bin\load-spark-env.cmd"
set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options]
rem Figure out which Python to use.
REM Manually skipped this cuz %PYSPARK_DRIVER_PYTHON%=="python"
if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
set PYSPARK_DRIVER_PYTHON=python
if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON%
)
REM Manually skipped these two cuz they already prefix %PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%
set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py
call "%SPARK_HOME%\bin\spark-submit2.cmd" pyspark-shell-main --name "PySparkShell" %*
The final statement above generates the following error:
Error: pyspark does not support any application options.
It is odd that `pyspark2.cmd` generates the unexpected `]` error while manually running each statement generates the above "applications options" error.
# Update 2023-07-19
Over the past week, I have *sometimes* been able to get the Spark prompt shown in the Annex below. Other times, I get the dreaded `] was unexpected at this time.` It doesn't matter whether or not I start from a virgin Anaconda prompt. For both outcomes (Spark prompt vs. "unexpected ]"), the series of commands are:
(base) C:\Users\User.Name> conda activate py39
(py39) C:\Users\User.Name> set PYSPARK_DRIVER_PYTHON=python
(py39) C:\Users\User.Name> set PYSPARK_PYTHON=python
(py39) C:\Users\User.Name> set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
(py39) C:\Users\User.Name> pyspark
# Update 2023-07-22
Due to the unrepeatable outcomes of issuing `pyspark`, I returned to troubleshooting by issuing each command in each invoked script. Careful bookkeeping was needed to keep track of the arguments `%*` in each script. The order of invocation is:
* `pyspark.cmd` calls `pyspark2.cmd`
* `pyspark2.cmd` calls `spark-submit2.cmd`
* `spark-submit2.cmd` executes `java`
The final `java` command is:
(py39) C:\Users\User.Name\anaconda3\envs\py39\Scripts> ^
"%RUNNER%" -Xmx128m ^
-cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main ^
org.apache.spark.deploy.SparkSubmit pyspark-shell-main ^
--name "PySparkShell" > %LAUNCHER_OUTPUT%
It generates the class-not-found error:
Error: Could not find or load main class org.apache.spark.launcher.Main
Caused by: java.lang.ClassNotFoundException: org.apache.spark.launcher.Main
Here are the environment variables:
%RUNNER% = java
%LAUNCH_CLASSPATH% = c:\Users\User.Name\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages\pyspark\jars\*
%LAUNCHER_OUTPUT% = C:\Users\User.Name\AppData\Local\Temp\spark-class-launcher-output-22633.txt
The RUNNER variable actually has two trailing spaces, and the quoted "%RUNNER%" invocation causes "java " to be unrecognized, so I removed the quotes.
# Annex: `pyspark` output from *1st* run (not 2nd run)
(py39) C:\Users\User.Name>pyspark
Python 3.9.17 (main, Jul 5 2023, 21:22:06) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Users/User.Name/anaconda3/pkgs/pyspark-3.2.1-py39haa95532_0/Lib/site-packages/pyspark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
23/07/07 17:49:58 WARN Shell: Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548)
at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569)
at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:689)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1886)
at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1846)
at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1819)
at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207)
at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304)
at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181)
at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58)
at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala)
at org.apache.spark.util.Utils$.createTempDir(Utils.scala:335)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:344)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468)
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:516)
... 22 more
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/07 17:50:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Python version 3.9.17 (main, Jul 5 2023 21:22:06)
Spark context Web UI available at http://HOST-NAME:4040
Spark context available as 'sc' (master = local[*], app id = local-1688766602995).
SparkSession available as 'spark'.
>>> 23/07/07 17:50:17 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
Some of these messages may be innocuous. I found *some* of them also at
[this page](https://medium.com/@divya.chandana/easy-install-pyspark-in-anaconda-e2d427b3492f)
about installing PySpark in Anaconda (specifically step 4, "Test Spark Installation"):
* That page also had the messages about illegal reflective access
* It did not have my long stack trace due to the file-not found exception pertaining to Hadoop Home being unset
* It did, however, have the same message "Unable to load native-hadoop library"
* It didn't have the final warning "ProcfsMetricsGetter: Exception when trying to compute pagesize"
After the passage of time and switching to another location and Wi-Fi network I go the following further messages:
23/07/07 19:25:30 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more
23/07/07 19:25:40 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more
23/07/07 19:25:50 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more
23/07/07 19:26:00 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more
23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
</details>
# 答案1
**得分**: 1
不需要执行以下操作:
```plaintext
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
我甚至建议不这样做。如果 pyspark
使用了正确的 Python 可执行文件,那么它应该也会使用正确的 site-packages
。此外,pkgs
也不是正确的指向位置。简而言之,conda
会将包下载并解压缩到 pkgs
,然后实际上将其“安装”到您的环境目录结构中,通常是通过创建一个链接(在这种情况下,两个位置将共享相同的文件),但不一定如此,您不应该依赖它。
关于 HADOOP_HOME
的问题,您还需要设置该环境变量,指向包含 winutils.exe
的目录。从 https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems 中可以找到解决方法:
您可以以两种方式解决这个问题:
1. 安装完整的本机 Windows 版本的 Hadoop。ASF 当前(截至 2015 年 9 月)没有发布这样的版本;可以在外部获得发布版本。
2. 或者:从 Hadoop 的某些版本的 [github](https://github.com/steveloughran/winutils) 存储库获取 `WINUTILS.EXE` 二进制文件。
然后:
1. 设置环境变量 `%HADOOP_HOME%`,指向包含 `WINUTILS.EXE` 的 BIN 目录上面的目录。
2. 或者:使用系统属性 `hadoop.home.dir` 运行 Java 进程,将其设置为主目录。
英文:
You should not need to do
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
and I would even advice against it. If pyspark
uses the correct python executable, then it should also use the correct site-packages
. Additionally pkgs
is the wrong location to point to anyway. Briefly, conda
downloads and extracts a package to pkgs
and then actually "installs" it into your env's directory structure, usually by creating a link (in which case both locations would share the same files), but not neccessarily and you shouldn't rely on it.
For the HADOOP_HOME
issue, you need to set that environment variable as well to point to the directory that contains winutils.exe
. From https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems :
> You can fix this problem in two ways
>
> 1. Install a full native windows Hadoop version. The ASF does not currently (September 2015) release such a version; releases are
> available externally.
> 2. Or: get the WINUTILS.EXE
binary from a Hadoop redistribution. There is a repository of this for some Hadoop versions on
> github.
>
> Then
>
> 1. Set the environment variable %HADOOP_HOME%
to point to the directory above the BIN dir containing WINUTILS.EXE
.
> 2. Or: run the Java process with the system property hadoop.home.dir set to the home directory.
答案2
得分: 0
FlyingTeller的回答绝对是"必须知道"的,以便让PySpark正常工作。然而,对于这个问题主题的错误消息,原因更加隐晦和模糊(以至于我在网上没有找到任何相关信息)。
追踪问题的原因:CMD变量赋值中的尾随空格
我在尝试跟进FlyingTeller有关安装Hadoop和WinUtils的建议时发现了问题。在从他推荐的GitHub站点获取了这些内容后,我不得不在Conda shell提示符下设置以下三个变量,然后发出pyspark
:
set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
set HADOOP_HOME=c:\Users\User.Name\AppData\Local\Hadoop\2.7.1
pyspark
上面的2.7.1
文件夹包含了包含winutils.exe
的bin
文件夹。我在手动日志文本文件中捕获了上面的确切文本,包括缩进和">"提示符。我使用了Vim的块可视模式(Ctrl+V)来选择从第一个"s"开始的四行,导致较短的行右侧填充了空格,以使它们与最长的(第三行)具有相同的长度。所有这些尾随空格都成为分配给PYSPARK_DRIVER_PYTHON和PYSPARK_PYTHON的字符串的一部分。在脚本的嵌套中,这些尾随空格会导致一系列(可能是意外的)调用,最终导致错误消息"在此时不是预期的"。
当将变量设置为没有尾随空格的字符串时,错误不会发生。我甚至没有想到要查看尾随空格是否是错误的原因,因为我更习惯于Bash,那里在变量赋值中会忽略尾随空格。
解决方法
**在Vim中使用可视模式:**在Vim中,可以使用可视模式(V)而不是块可视模式来避免选择过程中的尾随空格。但是,在选择多行时,这将捕获前缀每行的提示符">"和缩进。为了避免这种情况,我会在我的文本文件Journal中捕获没有">"提示符的命令。因此,上下文被牺牲,这在显示一系列跨不同环境的命令时可能会令人困惑。
例如,在上面的代码中,首先在CMD/Conda提示符下发出的命令最终导致PySpark的调用,可能会跟随在生成的PySpark/Python提示符下发出的命令。捕获带有要发出的一系列命令的提示字符可以清楚地指示解释器在命令列表的中途更改。
**单独复制每个命令,避免尾随空格:**复制命令的另一种选择是将每行命令逐个复制到系统剪贴板中,从第一个字母到行末。这样做是为了每行单独完成,所以它很快变得乏味。
**在CMD变量赋值周围使用引号:**我找到的最佳解决方案是在变量赋值周围使用引号,以忽略尾随空格:
set "PYSPARK_DRIVER_PYTHON=python"
set "PYSPARK_PYTHON=python"
set "HADOOP_HOME=c:\Users\User.Name\AppDta\Local\Hadoop\2.7.1"
pyspark
这样,Vim的块可视模式仍然可以用于将多行复制到剪贴板,同时避免提示字符和缩进空格。
消除其他错误
WARN NativeCodeLoader: 无法加载本机Hadoop库
以下链接表明这是一个无害的警告,令人困惑的是,只有在从头开始构建事物的情况下,Windows上的I/O才能保证正确。由于我不是开发人员,我觉得我不准备承担这个任务。
- https://sparkbyexamples.com/hadoop/hadoop-unable-to-load-native-hadoop-library-for-your-platform-warning/?expand_article=1
- https://lists.apache.org/thread/hk0rs1gwjyv5x89890f1hy5brpdc4v8r
显然,警告可以通过log4j来抑制,但我没有探索这一点。
关于"file:/tmp/spark-events"不存在的异常
为了解决这个问题,我创建了以下文件夹:
- %SPARK_HOME%\conf
- C:%HOMEPATH%\anaconda3\envs\py39\PySparkLogs
然后,我创建了以下文件:
%SPARK_HOME%/conf/spark-defaults.conf
#--------------------------------------
spark.eventLog.enabled true
spark.eventLog.dir C:\User\User.Name\anaconda3\envs\py39\PySparkLogs
park.history.fs.logDirectory C:\User\User.Name\anaconda3\envs\py39\PySparkLogs
上述解决方案的综合来源信息:
- https://sparkbyexamples.com/pyspark/how-to-install-and-run-pyspark-on-windows/?expand_article=1
- https://spark.apache.org/docs/latest/configuration.html
- https://www.youtube.com/watch?v=KeEcWFRBqnU
请注意,在Conda提示符下没有设置%SPARK_HOME%
。我依赖于在过去的几周里部分成功地尝试进入PySpark提示符,然后查询
英文:
FlyingTeller's answer was definitely "a must know" in order to get
PySpark working. For the error message that is the subject line of this question,
however, the cause is more insidious and obscure (so much so that it's
no wonder I haven't found anything about it online).
Tracking down the cause: Trailing spaces in CMD variable assignments
I found the cause while trying to follow up on FlyingTeller's helpful
advice on installing Hadoop and WinUtils. After obtaining these from
the GitHub site that he recommended, I had to set the following three
variables at the Conda shell prompt, then issue pyspark
:
> set PYSPARK_DRIVER_PYTHON=python
> set PYSPARK_PYTHON=python
> set HADOOP_HOME=c:\Users\User.Name\AppData\Local\Hadoop\2.7.1
> pyspark
The 2.7.1
folder above contains the bin
folder that contains winutils.exe
. I had captured the exact text above in a manual journal text file,
including indents and ">" prompts. I used Vim's blockwise Visual
mode (Ctrl+V) to select the four lines into the system clipboard,
starting from the first "s". Blockwise Visual mode causes the shorter
lines to be right padded with spaces so that they are the same length
as the longest (3rd) line. All these trailing spaces become part of
the string that is assigned to PYSPARK_DRIVER_PYTHON and
PYSPARK_PYTHON above. Somewhere the nesting of scripts, these
trailing spaces result in a series of (likely unintended) invocations
that end in the error message "] was unexpected at this time".
The error doesn't occur when the variables are set to the strings
without trailing spaces. I had no idea to even look at the
possibility of trailing spaces as the cause for the error, since I'm more used to Bash, where trailing spaces are ignored in variable assignments.
The solution
In Vim, Use Visual mode: In Vim, trailing spaces can be avoided in
the selection process by using Visual mode (V) instead of blockwise
Visual mode. In selecting multiple lines, however, that would capture
the prompt symbols ">" and indentations that prefix each line. To
avoid this, I would have capture commands in my text file Journal
without the ">" prompts. Thus, context is sacrificed, which can be
confusing when showing a series of commands across different
environment.
In the above code, for example, initial commands issued at the
CMD/Conda prompt culminate in the invocation of PySpark, which might
be followed by commands issued at a resulting PySpark/Python prompt.
Capturing the prompt character with the series of commands to be
issued gives a clear indication that the interpreter changes partway
through the list of commands.
Copy each command separately, avoiding trailing spaces: Another
alternative to using Vim's blockwise Visual mode to copy commands is
to Vim-"yank" each line into the system clipboard, from the 1st letter
to the end of the line. It is done for each line individually, so it
quickly gets tedious.
Use quotes around CMD variable assignments: The best
solution that found was to use
quotes around the variable assignments so that trailing spaces are
ignored:
> set "PYSPARK_DRIVER_PYTHON=python"
> set "PYSPARK_PYTHON=python"
> set "HADOOP_HOME=c:\Users\User.Name\AppDta\Local\Hadoop\2.7.1"
> pyspark
This way, Vim's blockwise Visual mode can still be used to copy
multiple lines into the clipboard while avoiding prompt characters
and indentation spaces.
<a name="ElimRestErrs">Eliminating the rest of the errors</a>
WARN NativeCodeLoader: Unable to load native-hadoop library
The following links indicate that this is an innocuous warning,
with the perplexing caveat that I/O on Windows is only guaranteed
to be correct if things are built from scratch. As I am not a
developer, I felt that I wasn't prepared to take that on.
- https://sparkbyexamples.com/hadoop/hadoop-unable-to-load-native-hadoop-library-for-your-platform-warning/?expand_article=1
- https://lists.apache.org/thread/hk0rs1gwjyv5x89890f1hy5brpdc4v8r
Apparently, warnings can be suppressed via
log4j, but I didn't
explore that.
The exception for non-existence of "file:/tmp/spark-events"
To solve this, I created the following folders:
- %SPARK_HOME%\conf
- C:%HOMEPATH%\anaconda3\envs\py39\PySparkLogs
I then created the following file:
# %SPARK_HOME%/conf/spark-defaults.conf
#--------------------------------------
spark.eventLog.enabled true
spark.eventLog.dir C:\\User\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
park.history.fs.logDirectory C:\\User\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
Sources of information from which the above solution was synthesized:
- https://sparkbyexamples.com/pyspark/how-to-install-and-run-pyspark-on-windows/?expand_article=1
- https://spark.apache.org/docs/latest/configuration.html
- https://www.youtube.com/watch?v=KeEcWFRBqnU
Note that %SPARK_HOME%
was not set at the Conda prompt. I relied on
partially successful attempts at getting to the PySpark prompt in the
past few weeks, then querying and recording the environment variable
SPARK_HOME
:
>>> print(os. environ. get("SPARK_HOME"))
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark
I definitely didn't want to be explicitly setting unnecessary
variables because it might be incompatible with how package
installers would set them. FlyingTeller's correction on my
setting of PYTHONPATH
made this abundantly clear.
Cannot run program ... winutils.exe
: Access denied
While the above eliminated the exception for the non-existence of
"file:/tmp/spark-events", it didn't succeed in entering PySpark,
apparently due to error in initializing "SparkContext", in turn due
to inability to run wintuils.exe
. The cause is access denial, but to
what was unclear:
23/07/28 14:01:12 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Cannot run program "C:\Users\User.Name\AppData\Local\Hadoop\2.7.1\bin\winutils.exe": CreateProcess error=5, Access is denied
Using Cygwin's Bash, I found that none of the files
c:\Users\User.Name\AppData\Local\Hadoop\2.7.1\bin\*.exe
had execute
permission, so fix that. This eliminated all of the errors and
warnings that had me concerned:
# Using Cygwin's Bash
$ chmod u+x /c/Users/User.Name/AppData/Local/Hadoop/2.7.1/bin/*.exe
REM At the Conda prompt
(py39) C:\Users\User.Name> set "PYSPARK_DRIVER_PYTHON=python"
(py39) C:\Users\User.Name> set "PYSPARK_PYTHON=python"
(py39) C:\Users\User.Name> set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
(py39) C:\Users\User.Name> pyspark
Python 3.9.17 (main, Jul 5 2023, 21:22:06) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Users/User.Name/anaconda3/envs/py39/Lib/site-packages/pyspark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/28 15:05:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Python version 3.9.17 (main, Jul 5 2023 21:22:06)
Spark context Web UI available at http://Laptop-Hostname:4040
Spark context available as 'sc' (master = local[*], app id = local-1690571121168).
SparkSession available as 'spark'.
>>> 23/07/28 15:05:35 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
>>>
ProcfsMetricsGetter
: Exception when trying to compute pagesize
This is the final line in the output immediately above. On Windows,
it is "expected behaviour" (as they say). It is innocuous, according
to Wing Yew Poon
(who I would probably know if I was a developer) and quoted
here in this
answer.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论