2023年7月10日 10:57:40go评论192阅读模式

英文:

pyspark on Anaconda: ] was unexpected at this time

问题

我正在按照此页面的步骤，在Windows 10上的Anaconda中安装PySpark。在验证PySpark的第6步中，遇到了无法找到Python的问题。我发现这个答案最初帮助我进展到看到PySpark标语的点上。下面是我在Anaconda提示符（而不是Anaconda Powershell提示符）中使用的解决方案的命令：

set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
# set PYTHONPATH=C:\Users\&lt;user&gt;\anaconda3\pkgs\pyspark-3.4.0-pyhd8ed1ab_0\site-packages
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
pyspark

如上所示，必须修改PYTHONPATH以匹配我自己安装中的文件夹树。我假设在安装期间，Conda选择了与当前py39环境中的Python 3.9兼容的PySpark版本，以满足包依赖关系。我使用此版本以与其他人兼容。

在此之后，PySpark首次运行，但出现了许多错误（请参见下文附录）。由于我对Python、Anaconda和PySpark都很陌生，我发现这些错误至少令人困惑。但正如附录中所示，我确实得到了Spark标语和Python提示符。

作为排除错误的第一步，我尝试关闭并重新打开Conda提示符窗口。但是，第二次运行pyspark时出现了不同的错误，同样令人困惑。

第2次运行的pyspark输出：

set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
pyspark

] was unexpected at this time.

为了追踪此不同错误消息的原因，我搜索了当我发出pyspark命令时执行的文件。以下是候选文件：

where pyspark

C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark
C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark.cmd

我注意到第一个脚本pyspark是一个Bash脚本，所以“] was unexpected at this time.”并不令人惊讶。我假设第二个脚本pyspark.cmd是为了从Windows的CMD解释器中调用的，而Conda提示符是通过设置某些环境变量来定制的，例如：通过运行pyspark.cmd，但它生成了与Bash脚本相同的错误“] was unexpected at this time.”除了@echo off之外，pyspark.cmd中唯一的命令是cmd /V /E /C "&"%~dp0pyspark2.cmd" %*"，这对我来说无法解释。

这个问题似乎很奇怪，Bash脚本pyspark设置为在Windows上的Conda环境中运行。这是因为在运行pyspark之前设置上述3个环境变量时引起的基本无意义吗？

运行pyspark.cmd为什么会生成与运行Bash脚本相同的错误？

故障排除

我追踪了第2次错误消息到C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd。它由pyspark.cmd调用，同时生成了意外的]错误：

cd C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts
psypark2.cmd

] was unexpected at this time.

为了找到有问题的语句，我手动发出了pyspark2.cmd中的每个命令，但没有得到相同的错误。除了REM语句外，以下是pyspark2.cmd的内容：

REM `C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd`
REM -------------------------------------------------------------
@echo off
rem Figure out where the Spark framework is installed
call "%~dp0find-spark-home.cmd"

call "%SPARK_HOME%\bin\load-spark-env.cmd"
set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options]

rem Figure out which Python to use.
if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
  set PYSPARK_DRIVER_PYTHON=python
  if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON%
)

set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%

set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py

call "%SPARK_HOME%\bin\spark-submit2.cmd" pyspark-shell-main --name "PySparkShell" %*

以下是上述命令的调整版，稍作修改以考虑在交互式提示符而不是在脚本文件中执行：

REM ~/tmp/tmp.cmd mirrors pyspark2.cmd
REM ----------------------------------
REM Note that %SPARK_HOME%==
REM "c:\Users\%USERNAME%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages\pyspark"

cd C:\Users\%USERNAME%\anaconda3\

<details>
<summary>英文:</summary>

I am following [this page](https://sparkbyexamples.com/pyspark/install-pyspark-in-anaconda-jupyter-notebook) to install PySpark in Anaconda on Windows 10.  In step #6 for validating PySpark, Python [could not be found](https://stackoverflow.com/questions/73823644).  I found that [this answer](https://stackoverflow.com/a/76552980/2153235) initially helped me progress to the point of seeing the PySpark banner.  Here is my adaption of the solution in the form of commands issued at the Anaconda prompt (not the Anaconda Powershell prompt):

    set PYSPARK_DRIVER_PYTHON=python
    set PYSPARK_PYTHON=python
    # set PYTHONPATH=C:\Users\&lt;user&gt;\anaconda3\pkgs\pyspark-3.4.0-pyhd8ed1ab_0\site-packages
    set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
    pyspark

As shown above, the PYTHONPATH had to be modified to match the folder tree in my own installation.  Essentially, I searched for a folder in `c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0` named `site-packages`.  I assume that the PySpark version was selected by Conda during installation to satisify package dependencies in the current `py39` environment, which contains Python 3.9.  I use this version for compatibility with others.

PySpark ran for the *1st time* after this, but with many, many errors (see Annex below).  As I am new to Python, Anaconda, and PySpark, I find the errors to be confusing to say the least.  As shown in the Annex, however, I did get the Spark banner and the Python prompt.

As my very first step to troubleshooting the errors, I tried closing and reopening the Conda prompt window.  However, the error from this *2nd run* of `pyspark` was *different* -- and equally confusing.

**pyspark output from *2nd* run:**

    set PYSPARK_DRIVER_PYTHON=python
    set PYSPARK_PYTHON=python
    set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
    pyspark

       ] was unexpected at this time.

To trace the cause of this different error message, I searched for the file that is executed when I issue `pyspark`.  Here are the candidate files:

    where pyspark

       C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark
       C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark.cmd

I noted that the 1st script `pyspark` is a *Bash* script, so it&#39;s not suprising that &quot;] was unexpected at this time.&quot;  I assumed that the 2nd script `pyspark.cmd` is for invocation from Windows&#39;s CMD interpreter, of which the Conda prompt is a customization, e.g., by setting certain environment variables.  Therefore, I ran `pyspark.cmd`, but it generated the same error &quot;] was unexpected at this time.&quot;  Apart from `@echo off`, the only command in `pyspark.cmd` is `cmd /V /E /C &quot;&quot;%~dp0pyspark2.cmd&quot; %*&quot;`, which is indecipherable to me.

***It seems odd that the Bash script `pyspark` is set up to run in a Conda environment on Windows.  Is this caused by a fundamental nonsensicality in setting the 3 environment variables above prior to running `pyspark`?***

*And why would running `pyspark.cmd` generate the same error as running the Bash script?*

# Troubleshooting

I tracked the 2nd error message down to `C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd`.  It is invoked by `pyspark.cmd` and also generates the unexpected `]` error:

    cd C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts
    psypark2.cmd

       ] was unexpected at this time.

To locate the problematic statement, I manually issued each command in `pyspark2.cmd` but did *not* get the same error.  Apart from REM statements, here is `pyspark2.cmd`:

    REM `C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd`
    REM -------------------------------------------------------------
    @echo off
    rem Figure out where the Spark framework is installed
    call &quot;%~dp0find-spark-home.cmd&quot;

    call &quot;%SPARK_HOME%\bin\load-spark-env.cmd&quot;
    set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options]

    rem Figure out which Python to use.
    if &quot;x%PYSPARK_DRIVER_PYTHON%&quot;==&quot;x&quot; (
      set PYSPARK_DRIVER_PYTHON=python
      if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON%
    )

    set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
    set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%

    set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
    set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py

    call &quot;%SPARK_HOME%\bin\spark-submit2.cmd&quot; pyspark-shell-main --name &quot;PySparkShell&quot; %*

Here is my palette of the above commands, slightly modified to account for the fact that they are executing at an interactive prompt rather than from within a script file:

    REM ~/tmp/tmp.cmd mirrors pyspark2.cmd
    REM ----------------------------------
    REM Note that %SPARK_HOME%==
    REM &quot;c:\Users\%USERNAME%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages\pyspark&quot;

    cd C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts

    call &quot;find-spark-home.cmd&quot;

    call &quot;%SPARK_HOME%\bin\load-spark-env.cmd&quot;
    set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options]

    rem Figure out which Python to use.
    REM Manually skipped this cuz %PYSPARK_DRIVER_PYTHON%==&quot;python&quot;
    if &quot;x%PYSPARK_DRIVER_PYTHON%&quot;==&quot;x&quot; (
      set PYSPARK_DRIVER_PYTHON=python
      if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON%
    )

    REM Manually skipped these two cuz they already prefix %PYTHONPATH%
    set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
    set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%

    set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
    set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py

    call &quot;%SPARK_HOME%\bin\spark-submit2.cmd&quot; pyspark-shell-main --name &quot;PySparkShell&quot; %*

The final statement above generates the following error:

    Error: pyspark does not support any application options.

It is odd that `pyspark2.cmd` generates the unexpected `]` error while manually running each statement generates the above &quot;applications options&quot; error.

# Update 2023-07-19

Over the past week, I have *sometimes* been able to get the Spark prompt shown in the Annex below.  Other times, I get the dreaded `] was unexpected at this time.`  It doesn&#39;t matter whether or not I start from a virgin Anaconda prompt.  For both outcomes (Spark prompt vs. &quot;unexpected ]&quot;), the series of commands are:

    (base) C:\Users\User.Name&gt; conda activate py39
    (py39) C:\Users\User.Name&gt; set PYSPARK_DRIVER_PYTHON=python
    (py39) C:\Users\User.Name&gt; set PYSPARK_PYTHON=python
    (py39) C:\Users\User.Name&gt; set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
    (py39) C:\Users\User.Name&gt; pyspark

# Update 2023-07-22

Due to the unrepeatable outcomes of issuing `pyspark`, I returned to troubleshooting by issuing each command in each invoked script.  Careful bookkeeping was needed to keep track of the arguments `%*` in each script.  The order of invocation is:

   * `pyspark.cmd` calls `pyspark2.cmd`
   * `pyspark2.cmd` calls `spark-submit2.cmd`
   * `spark-submit2.cmd` executes `java`

The final `java` command is:

    (py39) C:\Users\User.Name\anaconda3\envs\py39\Scripts&gt; ^
    &quot;%RUNNER%&quot; -Xmx128m ^
    -cp &quot;%LAUNCH_CLASSPATH%&quot; org.apache.spark.launcher.Main ^
    org.apache.spark.deploy.SparkSubmit pyspark-shell-main ^
    --name &quot;PySparkShell&quot; &gt; %LAUNCHER_OUTPUT%

It generates the class-not-found error:

    Error: Could not find or load main class org.apache.spark.launcher.Main
    Caused by: java.lang.ClassNotFoundException: org.apache.spark.launcher.Main

Here are the environment variables:

    %RUNNER% = java
    %LAUNCH_CLASSPATH% = c:\Users\User.Name\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages\pyspark\jars\*
    %LAUNCHER_OUTPUT% = C:\Users\User.Name\AppData\Local\Temp\spark-class-launcher-output-22633.txt

The RUNNER variable actually has two trailing spaces, and the quoted &quot;%RUNNER%&quot; invocation causes &quot;java  &quot; to be unrecognized, so I removed the quotes.

# Annex: `pyspark` output from *1st* run (not 2nd run)

    (py39) C:\Users\User.Name&gt;pyspark
    Python 3.9.17 (main, Jul  5 2023, 21:22:06) [MSC v.1916 64 bit (AMD64)] on win32
    Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
    WARNING: An illegal reflective access operation has occurred
    WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Users/User.Name/anaconda3/pkgs/pyspark-3.2.1-py39haa95532_0/Lib/site-packages/pyspark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
    WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
    WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
    WARNING: All illegal access operations will be denied in a future release
    23/07/07 17:49:58 WARN Shell: Did not find winutils.exe: {}
    java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
            at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548)
            at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569)
            at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592)
            at org.apache.hadoop.util.Shell.&lt;clinit&gt;(Shell.java:689)
            at org.apache.hadoop.util.StringUtils.&lt;clinit&gt;(StringUtils.java:79)
            at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1886)
            at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1846)
            at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1819)
            at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
            util.ShutdownHookManager$HookEntry.&lt;init&gt;(ShutdownHookManager.java:207)
            at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304)
            at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181)
            at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
            at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
            at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
            at org.apache.spark.util.ShutdownHookManager$.&lt;init&gt;(ShutdownHookManager.scala:58)
            at org.apache.spark.util.ShutdownHookManager$.&lt;clinit&gt;(ShutdownHookManager.scala)
            at org.apache.spark.util.Utils$.createTempDir(Utils.scala:335)
            at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:344)
            at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
            at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
            at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
            at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
            at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
            at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
            at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
            at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468)
            at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439)
            at org.apache.hadoop.util.Shell.&lt;clinit&gt;(Shell.java:516)
            ... 22 more
    Using Spark&#39;s default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to &quot;WARN&quot;.
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    23/07/07 17:50:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  &#39;_/
       /__ / .__/\_,_/_/ /_/\_\   version 3.2.1
          /_/

    Using Python version 3.9.17 (main, Jul  5 2023 21:22:06)
    Spark context Web UI available at http://HOST-NAME:4040
    Spark context available as &#39;sc&#39; (master = local[*], app id = local-1688766602995).
    SparkSession available as &#39;spark&#39;.
    &gt;&gt;&gt; 23/07/07 17:50:17 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

Some of these messages may be innocuous.  I found *some* of them also at
[this page](https://medium.com/@divya.chandana/easy-install-pyspark-in-anaconda-e2d427b3492f)
about installing PySpark in Anaconda (specifically step 4, &quot;Test Spark Installation&quot;):

 * That page also had the messages about illegal reflective access
 * It did not have my long stack trace due to the file-not found exception pertaining to Hadoop Home being unset
 * It did, however, have the same message &quot;Unable to load native-hadoop library&quot;
 * It didn&#39;t have the final warning &quot;ProcfsMetricsGetter: Exception when trying to compute pagesize&quot;

After the passage of time and switching to another location and Wi-Fi network I go the following further messages:

    23/07/07 19:25:30 WARN Executor: Issue communicating with driver in heartbeater
    org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
            at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
            at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
            at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
            at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
            at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
            at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
            at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
            at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
            at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
            at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
            at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
            at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base/java.lang.Thread.run(Thread.java:829)
    Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
            at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
            at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
            at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
            at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
            ... 12 more
    23/07/07 19:25:40 WARN Executor: Issue communicating with driver in heartbeater
    org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
            at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
            at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
            at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
            at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
            at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
            at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
            at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
            at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
            at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
            at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
            at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
            at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base/java.lang.Thread.run(Thread.java:829)
    Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
            at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
            at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
            at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
            at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
            ... 12 more
    23/07/07 19:25:50 WARN Executor: Issue communicating with driver in heartbeater
    org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
            at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
            at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
            at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
            at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
            at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
            at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
            at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
            at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
            at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
            at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
            at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
            at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base/java.lang.Thread.run(Thread.java:829)
    Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
            at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
            at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
            at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
            at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
            ... 12 more
    23/07/07 19:26:00 WARN Executor: Issue communicating with driver in heartbeater
    org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
            at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
            at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
            at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
            at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
            at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
            at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
            at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
            at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
            at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
            at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
            at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
            at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            at java.base/java.lang.Thread.run(Thread.java:829)
    Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
            at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
            at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
            at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
            at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
            ... 12 more
    23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
    23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
    23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
    23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)

</details>


# 答案1
**得分**: 1

不需要执行以下操作：

```plaintext
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages

我甚至建议不这样做。如果 pyspark 使用了正确的 Python 可执行文件，那么它应该也会使用正确的 site-packages。此外，pkgs 也不是正确的指向位置。简而言之，conda 会将包下载并解压缩到 pkgs，然后实际上将其“安装”到您的环境目录结构中，通常是通过创建一个链接（在这种情况下，两个位置将共享相同的文件），但不一定如此，您不应该依赖它。

关于 HADOOP_HOME 的问题，您还需要设置该环境变量，指向包含 winutils.exe 的目录。从 https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems 中可以找到解决方法：

您可以以两种方式解决这个问题：
1. 安装完整的本机 Windows 版本的 Hadoop。ASF 当前（截至 2015 年 9 月）没有发布这样的版本；可以在外部获得发布版本。
2. 或者：从 Hadoop 的某些版本的 [github](https://github.com/steveloughran/winutils) 存储库获取 `WINUTILS.EXE` 二进制文件。
然后：
1. 设置环境变量 `%HADOOP_HOME%`，指向包含 `WINUTILS.EXE` 的 BIN 目录上面的目录。
2. 或者：使用系统属性 `hadoop.home.dir` 运行 Java 进程，将其设置为主目录。

英文:

You should not need to do

set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages

and I would even advice against it. If pyspark uses the correct python executable, then it should also use the correct site-packages. Additionally pkgs is the wrong location to point to anyway. Briefly, conda downloads and extracts a package to pkgs and then actually "installs" it into your env's directory structure, usually by creating a link (in which case both locations would share the same files), but not neccessarily and you shouldn't rely on it.

For the HADOOP_HOME issue, you need to set that environment variable as well to point to the directory that contains winutils.exe. From https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems :

> You can fix this problem in two ways
>
> 1. Install a full native windows Hadoop version. The ASF does not currently (September 2015) release such a version; releases are
> available externally.
> 2. Or: get the WINUTILS.EXE binary from a Hadoop redistribution. There is a repository of this for some Hadoop versions on
> github.
>
> Then
>
> 1. Set the environment variable %HADOOP_HOME% to point to the directory above the BIN dir containing WINUTILS.EXE.
> 2. Or: run the Java process with the system property hadoop.home.dir set to the home directory.

答案2

得分: 0

FlyingTeller的回答绝对是"必须知道"的，以便让PySpark正常工作。然而，对于这个问题主题的错误消息，原因更加隐晦和模糊（以至于我在网上没有找到任何相关信息）。

追踪问题的原因：CMD变量赋值中的尾随空格

我在尝试跟进FlyingTeller有关安装Hadoop和WinUtils的建议时发现了问题。在从他推荐的GitHub站点获取了这些内容后，我不得不在Conda shell提示符下设置以下三个变量，然后发出pyspark：

set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
set HADOOP_HOME=c:\Users\User.Name\AppData\Local\Hadoop\2.7.1
pyspark

上面的2.7.1文件夹包含了包含winutils.exe的bin文件夹。我在手动日志文本文件中捕获了上面的确切文本，包括缩进和">"提示符。我使用了Vim的块可视模式（Ctrl+V）来选择从第一个"s"开始的四行，导致较短的行右侧填充了空格，以使它们与最长的（第三行）具有相同的长度。所有这些尾随空格都成为分配给PYSPARK_DRIVER_PYTHON和PYSPARK_PYTHON的字符串的一部分。在脚本的嵌套中，这些尾随空格会导致一系列（可能是意外的）调用，最终导致错误消息"在此时不是预期的"。

当将变量设置为没有尾随空格的字符串时，错误不会发生。我甚至没有想到要查看尾随空格是否是错误的原因，因为我更习惯于Bash，那里在变量赋值中会忽略尾随空格。

解决方法

**在Vim中使用可视模式：**在Vim中，可以使用可视模式（V）而不是块可视模式来避免选择过程中的尾随空格。但是，在选择多行时，这将捕获前缀每行的提示符">"和缩进。为了避免这种情况，我会在我的文本文件Journal中捕获没有">"提示符的命令。因此，上下文被牺牲，这在显示一系列跨不同环境的命令时可能会令人困惑。

例如，在上面的代码中，首先在CMD/Conda提示符下发出的命令最终导致PySpark的调用，可能会跟随在生成的PySpark/Python提示符下发出的命令。捕获带有要发出的一系列命令的提示字符可以清楚地指示解释器在命令列表的中途更改。

**单独复制每个命令，避免尾随空格：**复制命令的另一种选择是将每行命令逐个复制到系统剪贴板中，从第一个字母到行末。这样做是为了每行单独完成，所以它很快变得乏味。

**在CMD变量赋值周围使用引号：**我找到的最佳解决方案是在变量赋值周围使用引号，以忽略尾随空格：

set "PYSPARK_DRIVER_PYTHON=python"
set "PYSPARK_PYTHON=python"
set "HADOOP_HOME=c:\Users\User.Name\AppDta\Local\Hadoop\2.7.1"
pyspark

这样，Vim的块可视模式仍然可以用于将多行复制到剪贴板，同时避免提示字符和缩进空格。

消除其他错误

WARN NativeCodeLoader: 无法加载本机Hadoop库
以下链接表明这是一个无害的警告，令人困惑的是，只有在从头开始构建事物的情况下，Windows上的I/O才能保证正确。由于我不是开发人员，我觉得我不准备承担这个任务。

显然，警告可以通过log4j来抑制，但我没有探索这一点。

关于"file:/tmp/spark-events"不存在的异常
为了解决这个问题，我创建了以下文件夹：

%SPARK_HOME%\conf
C:%HOMEPATH%\anaconda3\envs\py39\PySparkLogs

然后，我创建了以下文件：

%SPARK_HOME%/conf/spark-defaults.conf

#--------------------------------------
spark.eventLog.enabled true
spark.eventLog.dir C:\User\User.Name\anaconda3\envs\py39\PySparkLogs
park.history.fs.logDirectory C:\User\User.Name\anaconda3\envs\py39\PySparkLogs

上述解决方案的综合来源信息：

请注意，在Conda提示符下没有设置%SPARK_HOME%。我依赖于在过去的几周里部分成功地尝试进入PySpark提示符，然后查询

英文:

FlyingTeller's answer was definitely "a must know" in order to get
PySpark working. For the error message that is the subject line of this question,
however, the cause is more insidious and obscure (so much so that it's
no wonder I haven't found anything about it online).

Tracking down the cause: Trailing spaces in CMD variable assignments

I found the cause while trying to follow up on FlyingTeller's helpful
advice on installing Hadoop and WinUtils. After obtaining these from
the GitHub site that he recommended, I had to set the following three
variables at the Conda shell prompt, then issue pyspark:

&gt; set PYSPARK_DRIVER_PYTHON=python
&gt; set PYSPARK_PYTHON=python
&gt; set HADOOP_HOME=c:\Users\User.Name\AppData\Local\Hadoop\2.7.1
&gt; pyspark

The 2.7.1 folder above contains the bin folder that contains winutils.exe. I had captured the exact text above in a manual journal text file,
including indents and ">" prompts. I used Vim's blockwise Visual
mode (Ctrl+V) to select the four lines into the system clipboard,
starting from the first "s". Blockwise Visual mode causes the shorter
lines to be right padded with spaces so that they are the same length
as the longest (3rd) line. All these trailing spaces become part of
the string that is assigned to PYSPARK_DRIVER_PYTHON and
PYSPARK_PYTHON above. Somewhere the nesting of scripts, these
trailing spaces result in a series of (likely unintended) invocations
that end in the error message "] was unexpected at this time".

The error doesn't occur when the variables are set to the strings
without trailing spaces. I had no idea to even look at the
possibility of trailing spaces as the cause for the error, since I'm more used to Bash, where trailing spaces are ignored in variable assignments.

The solution

In Vim, Use Visual mode: In Vim, trailing spaces can be avoided in
the selection process by using Visual mode (V) instead of blockwise
Visual mode. In selecting multiple lines, however, that would capture
the prompt symbols ">" and indentations that prefix each line. To
avoid this, I would have capture commands in my text file Journal
without the ">" prompts. Thus, context is sacrificed, which can be
confusing when showing a series of commands across different
environment.

In the above code, for example, initial commands issued at the
CMD/Conda prompt culminate in the invocation of PySpark, which might
be followed by commands issued at a resulting PySpark/Python prompt.
Capturing the prompt character with the series of commands to be
issued gives a clear indication that the interpreter changes partway
through the list of commands.

Copy each command separately, avoiding trailing spaces: Another
alternative to using Vim's blockwise Visual mode to copy commands is
to Vim-"yank" each line into the system clipboard, from the 1st letter
to the end of the line. It is done for each line individually, so it
quickly gets tedious.

Use quotes around CMD variable assignments: The best
solution that found was to use
quotes around the variable assignments so that trailing spaces are
ignored:

&gt; set &quot;PYSPARK_DRIVER_PYTHON=python&quot;
&gt; set &quot;PYSPARK_PYTHON=python&quot;
&gt; set &quot;HADOOP_HOME=c:\Users\User.Name\AppDta\Local\Hadoop\2.7.1&quot;
&gt; pyspark

This way, Vim's blockwise Visual mode can still be used to copy
multiple lines into the clipboard while avoiding prompt characters
and indentation spaces.

<a name="ElimRestErrs">Eliminating the rest of the errors</a>

WARN NativeCodeLoader: Unable to load native-hadoop library
The following links indicate that this is an innocuous warning,
with the perplexing caveat that I/O on Windows is only guaranteed
to be correct if things are built from scratch. As I am not a
developer, I felt that I wasn't prepared to take that on.

Apparently, warnings can be suppressed via
log4j, but I didn't
explore that.

The exception for non-existence of "file:/tmp/spark-events"
To solve this, I created the following folders:

%SPARK_HOME%\conf
C:%HOMEPATH%\anaconda3\envs\py39\PySparkLogs

I then created the following file:

# %SPARK_HOME%/conf/spark-defaults.conf
#--------------------------------------
spark.eventLog.enabled true
spark.eventLog.dir C:\\User\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
park.history.fs.logDirectory C:\\User\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs

Sources of information from which the above solution was synthesized:

Note that %SPARK_HOME% was not set at the Conda prompt. I relied on
partially successful attempts at getting to the PySpark prompt in the
past few weeks, then querying and recording the environment variable
SPARK_HOME:

&gt;&gt;&gt; print(os. environ. get(&quot;SPARK_HOME&quot;))
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark

I definitely didn't want to be explicitly setting unnecessary
variables because it might be incompatible with how package
installers would set them. FlyingTeller's correction on my
setting of PYTHONPATH made this abundantly clear.

Cannot run program ... winutils.exe: Access denied
While the above eliminated the exception for the non-existence of
"file:/tmp/spark-events", it didn't succeed in entering PySpark,
apparently due to error in initializing "SparkContext", in turn due
to inability to run wintuils.exe. The cause is access denial, but to
what was unclear:

23/07/28 14:01:12 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Cannot run program &quot;C:\Users\User.Name\AppData\Local\Hadoop\2.7.1\bin\winutils.exe&quot;: CreateProcess error=5, Access is denied

Using Cygwin's Bash, I found that none of the files
c:\Users\User.Name\AppData\Local\Hadoop\2.7.1\bin\*.exe had execute
permission, so fix that. This eliminated all of the errors and
warnings that had me concerned:

# Using Cygwin&#39;s Bash
$ chmod u+x /c/Users/User.Name/AppData/Local/Hadoop/2.7.1/bin/*.exe
REM At the Conda prompt
(py39) C:\Users\User.Name&gt; set &quot;PYSPARK_DRIVER_PYTHON=python&quot;
(py39) C:\Users\User.Name&gt; set &quot;PYSPARK_PYTHON=python&quot;
(py39) C:\Users\User.Name&gt; set &quot;HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1&quot;
(py39) C:\Users\User.Name&gt; pyspark
Python 3.9.17 (main, Jul  5 2023, 21:22:06) [MSC v.1916 64 bit (AMD64)] on win32
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Users/User.Name/anaconda3/envs/py39/Lib/site-packages/pyspark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark&#39;s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to &quot;WARN&quot;.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/28 15:05:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  &#39;_/
/__ / .__/\_,_/_/ /_/\_\   version 3.2.1
/_/
Using Python version 3.9.17 (main, Jul  5 2023 21:22:06)
Spark context Web UI available at http://Laptop-Hostname:4040
Spark context available as &#39;sc&#39; (master = local[*], app id = local-1690571121168).
SparkSession available as &#39;spark&#39;.
&gt;&gt;&gt; 23/07/28 15:05:35 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
&gt;&gt;&gt;

ProcfsMetricsGetter: Exception when trying to compute pagesize
This is the final line in the output immediately above. On Windows,
it is "expected behaviour" (as they say). It is innocuous, according
to Wing Yew Poon
(who I would probably know if I was a developer) and quoted
here in this
answer.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pyspark在Anaconda上：] 在此时是意外的。

问题

故障排除

答案2

追踪问题的原因：CMD变量赋值中的尾随空格

解决方法

消除其他错误

%SPARK_HOME%/conf/spark-defaults.conf

Tracking down the cause: Trailing spaces in CMD variable assignments

The solution

<a name="ElimRestErrs">Eliminating the rest of the errors</a>

RuntimeError: Java gateway process exited before sending its port number after setting JAVA_HOME

How does reduceByKey() in pyspark knows which column is key and which one is value?

Pyspark dataframe：如何在Databricks中删除数据帧中的重复行

在pySpark中计算非唯一列表元素的累积和。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论