如何在没有网络访问的情况下将包(例如mmlspark)安装到CDH集群?

huangapple go评论125阅读模式
英文:

how do I install parckage(such as mmlspark) to CDH cluster without network access?

问题

由于中国难以连接到maven.org,您无法通过以下方式安装mmlspark

  1. pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 --repositories=https://mmlspark.azureedge.net/maven

出现了以下错误:

  1. :::: ERRORS
  2. Server access error at url https://repo1.maven.org/maven2/com/microsoft/ml/lightgbm/lightgbmlib/2.3.100/lightgbmlib-2.3.100.pom (java.net.ConnectException: Connection timed out (Connection timed out))
  3. Server access error at url https://repo1.maven.org/maven2/com/microsoft/ml/lightgbm/lightgbmlib/2.3.100/lightgbmlib-2.3.100.jar (java.net.ConnectException: Connection timed out (Connection timed out))
  4. :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
  5. Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.microsoft.ml.lightgbm#lightgbmlib;2.3.100: not found, download failed: com.microsoft.ml.spark#mmlspark_2.11;1.0.0-rc1!mmlspark_2.11.jar, download failed: org.scalatest#scalatest_2.11;3.0.5!scalatest_2.11.jar(bundle), download failed: com.microsoft.cntk#cntk;2.4!cntk.jar, download failed: org.openpnp#opencv;3.2.0-1!opencv.jar(bundle)]
  6. at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1308)
  7. at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
  8. at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315)
  9. at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
  10. at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
  11. at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
  12. at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
  13. at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
  14. [TerminalIPythonApp] WARNING | Unknown error in handling PYTHONSTARTUP file /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/pyspark/shell.py:

您尝试了手动安装,将包下载到Amazon EC2实例,然后复制到本地CDH集群路径 /opt/cloudera/parcels/CDH/lib/spark/jars/mmlspark_jars/,并设置了相应的配置。然而,即使jar被加载,仍然出现了import mmlspark时的ModuleNotFoundError错误。

最后,您尝试了另一种方法,将mmlspark.jar提取出来,压缩成.zip文件,然后放到HDFS中,并在pyspark中使用--py-files加载.zip文件,这样import mmlspark才成功。

接下来,您尝试了一些代码,但出现了错误:

  1. Py4JJavaError Traceback (most recent call last)
  2. ...
  3. java.util.NoSuchElementException: Param metric does not exist.
  4. ...

这个错误可能是因为mmlspark Python端无法加载jar文件引起的。如果您需要帮助解决此错误,请提供更多关于您的环境和配置的信息。

英文:

Because it is hard to connect maven.org in China , I can't not install mmlspark by

  1. pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 --repositories=https://mmlspark.azureedge.net/maven

Would got

  1. :::: ERRORS
  2. Server access error at url https://repo1.maven.org/maven2/com/microsoft/ml/lightgbm/lightgbmlib/2.3.100/lightgbmlib-2.3.100.pom (java.net.ConnectException: Connection timed out (Connection timed out))
  3. Server access error at url https://repo1.maven.org/maven2/com/microsoft/ml/lightgbm/lightgbmlib/2.3.100/lightgbmlib-2.3.100.jar (java.net.ConnectException: Connection timed out (Connection timed out))
  4. :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
  5. Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.microsoft.ml.lightgbm#lightgbmlib;2.3.100: not found, download failed: com.microsoft.ml.spark#mmlspark_2.11;1.0.0-rc1!mmlspark_2.11.jar, download failed: org.scalatest#scalatest_2.11;3.0.5!scalatest_2.11.jar(bundle), download failed: com.microsoft.cntk#cntk;2.4!cntk.jar, download failed: org.openpnp#opencv;3.2.0-1!opencv.jar(bundle)]
  6. at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1308)
  7. at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
  8. at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315)
  9. at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
  10. at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
  11. at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
  12. at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
  13. at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
  14. [TerminalIPythonApp] WARNING | Unknown error in handling PYTHONSTARTUP file /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/pyspark/shell.py:

Try installation manually

I have an amazon ec2 instance , it can access the maven.org , I downloaded all packages and copy to local CDH cluster , path /opt/cloudera/parcels/CDH/lib/spark/jars/mmlspark_jars/ ,

And set config :

First
spark-defaults.conf:

  1. spark.driver.extraClassPath=/opt/cloudera/parcels/CDH/lib/spark/jars/mmlspark_jars/*
  2. spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/spark/jars/mmlspark_jars/*

Second:
spark-env.sh:

  1. export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/spark/jars/mmlspark_jars/*:$SPARK_CLASSPATH

Can see jar be loaded

如何在没有网络访问的情况下将包(例如mmlspark)安装到CDH集群?

But import mmlspark still ModuleNotFoundError: No module named 'mmlspark'

With some effort

I found: extract mmlspark.jar, zip mmlspark inside the folder and put to hdfs( hdfs://test/mmlspark.zip ), load this .zip in pyfiles (--py-files hdfs://test/mmlspark.zip ), Can make import mmlspark success .

I start a pyspark shell with jar dependencies and mmlspark.zip :

  1. pyspark --jars "/user/spark/mmlspark_jars/com.github.vowpalwabbit_vw-jni-8.7.0.3.jar,/user/spark/mmlspark_jars/com.jcraft_jsch-0.1.54.jar,/user/spark/mmlspark_jars/com.microsoft.cntk_cntk-2.4.jar,/user/spark/mmlspark_jars/com.microsoft.ml.lightgbm_lightgbmlib-2.3.100.jar,/user/spark/mmlspark_jars/com.microsoft.ml.spark_mmlspark_2.11-1.0.0-rc1.jar,/user/spark/mmlspark_jars/commons-codec_commons-codec-1.10.jar,/user/spark/mmlspark_jars/commons-logging_commons-logging-1.2.jar,/user/spark/mmlspark_jars/io.spray_spray-json_2.11-1.3.2.jar,/user/spark/mmlspark_jars/org.apache.httpcomponents_httpclient-4.5.6.jar,/user/spark/mmlspark_jars/org.apache.httpcomponents_httpcore-4.4.10.jar,/user/spark/mmlspark_jars/org.openpnp_opencv-3.2.0-1.jar,/user/spark/mmlspark_jars/org.scala-lang.modules_scala-xml_2.11-1.0.6.jar,/user/spark/mmlspark_jars/org.scala-lang_scala-reflect-2.11.12.jar,/user/spark/mmlspark_jars/org.scalactic_scalactic_2.11-3.0.5.jar,/user/spark/mmlspark_jars/org.scalatest_scalatest_2.11-3.0.5.jar" --py-files hdfs://test/mmlspark.zip

Test code

  1. from sklearn import datasets
  2. iris = datasets.load_iris()
  3. X = iris.data
  4. Y = iris.target
  5. df = np.column_stack([X,Y])
  6. df = pd.DataFrame(df)
  7. df.columns = ['f1', 'f2', 'f3', 'f4', 'label']
  8. feature_cols = ['f1', 'f2', 'f3', 'f4']
  9. df = spark.createDataFrame(df)
  10. from pyspark.ml.feature import VectorAssembler
  11. vec_assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
  12. df1 = vec_assembler.transform(df)
  13. from mmlspark.lightgbm import LightGBMRegressor
  14. model = LightGBMRegressor(objective='quantile',
  15. alpha=0.2,
  16. learningRate=0.3,
  17. numLeaves=31,
  18. featuresCol='features',
  19. labelCol='label').fit(df1)
  1. Py---------------------------------------------------------------------------
  2. Py4JJavaError Traceback (most recent call last)
  3. <ipython-input-55-fe341b86ea18> in <module>
  4. 18 numLeaves=31,
  5. 19 featuresCol='features',
  6. ---> 20 labelCol='label').fit(df1)
  7. /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
  8. 130 return self.co(params)._fit(dataset)
  9. 131 else:
  10. --> 132 return self._fit(dataset)
  11. 133 else:
  12. 134 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
  13. /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in _fit(self, dataset)
  14. 293
  15. 294 def _fit(self, dataset):
  16. --> 295 java_model = self._fit_java(dataset)
  17. 296 model = self._create_model(java_model)
  18. 297 return self._copyValues(model)
  19. /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in _fit_java(self, dataset)
  20. 289 :return: fitted Java model
  21. 290 """
  22. --> 291 self._transfer_params_to_java()
  23. 292 return self._java_obj.fit(dataset._jdf)
  24. 293
  25. /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in _transfer_params_to_java(self)
  26. 125 self._java_obj.set(pair)
  27. 126 if self.hasDefault(param):
  28. --> 127 pair = self._make_java_param_pair(param, self._defaultParamMap[param])
  29. 128 pair_defaults.append(pair)
  30. 129 if len(pair_defaults) > 0:
  31. /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in _make_java_param_pair(self, param, value)
  32. 111 sc = SparkContext._active_spark_context
  33. 112 param = self._resolveParam(param)
  34. --> 113 java_param = self._java_obj.getParam(param.name)
  35. 114 java_value = _py2java(sc, value)
  36. 115 return java_param.w(java_value)
  37. /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
  38. 1255 answer = self.gateway_client.send_command(command)
  39. 1256 return_value = get_return_value(
  40. -> 1257 answer, self.gateway_client, self.target_id, self.name)
  41. 1258
  42. 1259 for temp_arg in temp_args:
  43. /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
  44. 61 def deco(*a, **kw):
  45. 62 try:
  46. ---> 63 return f(*a, **kw)
  47. 64 except py4j.protocol.4JJavaError as e:
  48. 65 s = e.java_exception.toString()
  49. /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
  50. 326 raise Py4JJavaError(
  51. 327 "An error occurred while calling {0}{1}{2}.\n".
  52. --> 328 format(target_id, ".", name), value)
  53. 329 else:
  54. 330 raise Py4JError(
  55. Py4JJavaError: An error occurred while calling o1298.getParam.
  56. : java.util.NoSuchElementException: Param metric does not exist.
  57. at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
  58. at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
  59. at scala.Option.getOrElse(Option.scala:121)
  60. at org.apache.spark.ml.param.Params$class.getParam(params.scala:728)
  61. at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42)
  62. at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
  63. at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  64. at java.lang.reflect.Method.invoke(Method.java:498)
  65. at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
  66. at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
  67. at py4j.Gateway.invoke(Gateway.java:282)
  68. at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  69. at py4j.commands.CallCommand.execute(CallCommand.java:79)
  70. at py4j.GatewayConnection.run(GatewayConnection.java:238)
  71. at java.lang.Thread.run(Thread.java:748)
  72. py

Here, I think this error is because mmlspark python port can not load the jar , which causes Py4JJavaError. But I have no idea, I have done everything I know .

答案1

得分: 0

Finally I got it around.
The key is pass .jar to pyFiles, this is very surprise me that python can read .jar .

bash:

  1. pyspark \
  2. --master yarn \
  3. --conf spark.submit.pyFiles=hdfs://pupuxdc/user/spark/mmlspark_jars/…… .jar \
  4. --conf spark.yarn.dist.jars=hdfs://pupuxdc/user/spark/mmlspark_jars/…… .jar

pyspark code:

  1. spark_builder = (
  2. SparkSession
  3. .builder
  4. .config("spark.port.maxRetries", 100)
  5. .appName(app_name))
  6. spark = spark_builder.getOrCreate()
  7. jar_files = [...]
  8. for i in jar_files:
  9. spark.sparkContext.addPyFile(i)

Notice, .config('spark.submit.pyFiles=hdfs://pupuxdc/user/spark/mmlspark_jars/…… .jar') doesn't take effect.

英文:

Finally I got it around.
The key is pass .jar to pyFiles, this is very surprise me that python can read .jar .

bash:

  1. pyspark \
  2. --master yarn \
  3. --conf spark.submit.pyFiles=hdfs://pupuxdc/user/spark/mmlspark_jars/…… .jar \
  4. --conf spark.yarn.dist.jars=hdfs://pupuxdc/user/spark/mmlspark_jars/…… .jar

pyspark code:

  1. spark_builder = (
  2. SparkSession
  3. .builder
  4. .config("spark.port.maxRetries", 100)
  5. .appName(app_name))
  6. spark = spark_builder.getOrCreate()
  7. jar_files = [...]
  8. for i in jar_files:
  9. spark.sparkContext.addPyFile(i)

Notice, .config('spark.submit.pyFiles=hdfs://pupuxdc/user/spark/mmlspark_jars/…… .jar') doesn't take effect .

答案2

得分: 0

I tried comment, but I do not have enough reputation, so for those using HDP the same answer from Mithril applies.

Also, you are no required to upload the jar files to hdfs. I can achieve the same result reading the jar files from local dirs.

bash:

  1. pyspark \
  2. --master yarn \
  3. --py-files /<path>/.jar,/<path>/.jar,/<path>/.jar... \
  4. --jars /<path>/.jar,/<path>/.jar,/<path>/.jar...

It works with Jupyter Notebook, too. Just include following line just before stars the SparkSession.

  1. os.environ['PYSPARK_SUBMIT_ARGS'] = '--py-files /<path>/.jar,/<path>/.jar,/<path>/.jar... --jars /<path>/.jar,/<path>/.jar,/<path>/.jar...'
英文:

I tried comment, but I do not have enough reputation, so for those using HDP the same answer from Mithril applies.

Also, you are no required to upload the jar files to hdfs. I can achieve the same result reading the jar files from local dirs.

bash:

  1. pyspark \
  2. --master yarn \
  3. --py-files /<path>/.jar,/<path>/.jar,/<path>/.jar... \
  4. --jars /<path>/.jar,/<path>/.jar,/<path>/.jar...

It works with Jupyter Notebook, too. Just include following line just before stars the SparkSession.

  1. os.environ['PYSPARK_SUBMIT_ARGS'] = '--py-files /<path>/.jar,/<path>/.jar,/<path>/.jar... --jars /<path>/.jar,/<path>/.jar,/<path>/.jar...'

huangapple
  • 本文由 发表于 2020年7月29日 13:36:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/63146931.html
匿名

发表评论

匿名网友
#

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定