在mac上安装下pySpark,并且在pyCharm中python调用pyspark

2017-10-13 19:17:14来源:CSDN作者:helloxiaozhe人点击

分享

在mac上安装下pySpark,并且在pyCharm中python调用pyspark。目前用python比较多,所以想安装下pySpark,并且在pyCharm中调用。 

ython 利用pyspark 直接在本地操作spark,运行spark程序
本文将从软件下载,安装,第一部分配置,编程,初次运行,第二部分配置,最终正确运行,这几个方面进行,下面,闲话不说,码上呈现过程。

1、下载软件包:
jdk-8u131-macosx-x64.dmgspark-2.1.0-bin-hadoop2.6.tgz
2、安装spark环境
1)jdk默认安装 (2)spark-2.1.0-bin-hadoop2.6.tgz先进行解压,并进行相关配置。假设目录为/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6 (3)这时,切换到 /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/bin,输入pySpark。如果你安装成功的话,可看到下图 

3、配置在PyCharm中调用pySpark的加载包
如何抛开mac权限问题,强行安装python第三方包。要想在PyCharm中调用pySpark,需要加载包。将/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/文件夹下pySpark文件夹拷贝到/Library/Python/2.7/site-packages/(注:我的python安装目录是这个路径,可能有的读者是C:/Anaconda2/Lib/site-packages**或者C:/Python27/Lib/site-packages)
(1)首选找到找到/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/这个目录,然后将pyspark这个文件夹整个拷贝到/Library/Python/2.7/site-packages/这个目录下:
localhost:python a6$ pwd/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/pythonlocalhost:python a6$ cd pyspark/localhost:pyspark a6$ ls__init__.py        broadcast.pyc        context.py        find_spark_home.py    java_gateway.pyc    profiler.py        rddsampler.pyc        shell.py        statcounter.pyc        streaming        version.pyc__init__.pyc        cloudpickle.py        context.pyc        find_spark_home.pyc    join.py            profiler.pyc        resultiterable.py    shuffle.py        status.py        tests.py        worker.pyaccumulators.py        cloudpickle.pyc        daemon.py        heapq3.py        join.pyc        rdd.py            resultiterable.pyc    shuffle.pyc        status.pyc        traceback_utils.pyaccumulators.pyc    conf.py            files.py        heapq3.pyc        ml            rdd.pyc            serializers.py        sql            storagelevel.py        traceback_utils.pycbroadcast.py        conf.pyc        files.pyc        java_gateway.py        mllib            rddsampler.py        serializers.pyc        statcounter.py        storagelevel.pyc    version.py


(2)找到python的包目录:/Library/Python/2.7/site-packages/
localhost:python a6$ pythonPython 2.7.10 (default, Feb  7 2017, 00:08:15)[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import sys>>> print sys.path['', '/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg', '/Library/Python/2.7/site-packages/py4j-0.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/redis-2.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/MySQL_python-1.2.4-py2.7-macosx-10.12-intel.egg', '/Library/Python/2.7/site-packages/thrift-0.10.0-py2.7-macosx-10.12-intel.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC']>>> exit()

从而确定包目录为:/Library/Python/2.7/site-packages/
(3)进行pyspark的拷贝,发现没有权限,从而进行变相拷贝,如下:
localhost:site-packages a6$ pwd/Library/Python/2.7/site-packageslocalhost:site-packages a6$ mkdir pysparkmkdir: pyspark: Permission deniedlocalhost:site-packages a6$ sudo mkdir pysparkPassword:localhost:pyspark a6$ pwd/Library/Python/2.7/site-packages/pysparklocalhost:pyspark a6$ cp /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./cp: ./__init__.py: Permission deniedcp: ./__init__.pyc: Permission deniedcp: ./accumulators.py: Permission deniedcp: ./accumulators.pyc: Permission deniedcp: ./broadcast.py: Permission deniedcp: ./broadcast.pyc: Permission denied…………cp: ./join.pyc: Permission deniedcp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/ml is a directory (not copied).cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/mllib is a directory (not copied).cp: ./profiler.py: Permission deniedcp: ./profiler.pyc: Permission deniedcp: ./rdd.py: Permission deniedcp: ./rdd.pyc: Permission deniedcp: ./rddsampler.py: Permission deniedcp: ./rddsampler.pyc: Permission deniedcp: ./resultiterable.py: Permission deniedcp: ./resultiterable.pyc: Permission deniedcp: ./serializers.py: Permission deniedcp: ./serializers.pyc: Permission deniedlocalhost:pyspark a6$ sudo cp /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/ml is a directory (not copied).cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/mllib is a directory (not copied).cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql is a directory (not copied).cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/streaming is a directory (not copied).localhost:pyspark a6$ sudo cp -rf  /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./localhost:pyspark a6$ ls__init__.py        broadcast.pyc        context.py        find_spark_home.py    java_gateway.pyc    profiler.py        rddsampler.pyc        shell.py        statcounter.pyc        streaming        version.pyc__init__.pyc        cloudpickle.py        context.pyc        find_spark_home.pyc    join.py            profiler.pyc        resultiterable.py    shuffle.py        status.py        tests.py        worker.pyaccumulators.py        cloudpickle.pyc        daemon.py        heapq3.py        join.pyc        rdd.py            resultiterable.pyc    shuffle.pyc        status.pyc        traceback_utils.pyaccumulators.pyc    conf.py            files.py        heapq3.pyc        ml            rdd.pyc            serializers.py        sql            storagelevel.py        traceback_utils.pycbroadcast.py        conf.pyc        files.pyc        java_gateway.py        mllib            rddsampler.py        serializers.pyc        statcounter.py        storagelevel.pyc    version.pylocalhost:pyspark a6$

4、python操作pyspark的编码代码如下:
from operator import addfrom pyspark import SparkContextif __name__ == "__main__":    sc = SparkContext(appName="PythonWordCount")    lines = sc.textFile('words.txt')    counts = lines.flatMap(lambda x: x.split(' ')) /                  .map(lambda x: (x, 1)) /                  .reduceByKey(add)    output = counts.collect()    for (word, count) in output:        print "%s: %i" % (word, count)    sc.stop()


代码中words.txt内容如下
good bad cool hadoop spark mlib good spark mlib cool spark bad


4、初步运行
(1)初步运行,然后报错,哈哈哈 ,补充配置。
/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/a6/Downloads/PycharmProjects/test_use_hbase_by_thrift/test_python_local_use_spark.pyCould not find valid SPARK_HOME while searching ['/Users/a6/Downloads/PycharmProjects', '/Library/Python/2.7/site-packages/pyspark', '/Library/Python/2.7/site-packages/pyspark', '/Library/Python/2.7']Process finished with exit code 255


(2)其实是还有一个地方没有配置 在pyCharm的菜单栏里找到Run => Edit Configurations,点击下面红色标记的地方,添加环境变量。 添加spark的按照目录,如红色框部分所示:



(3)再次运行得到如下正确结果 
/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/a6/Downloads/PycharmProjects/test_use_hbase_by_thrift/test_python_local_use_spark.pyUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.propertiesSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).17/10/13 16:30:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17/10/13 16:30:48 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 10.2.32.96 instead (on interface en0)17/10/13 16:30:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spillingbad: 2spark: 3mlib: 2good: 2hadoop: 1cool: 2Process finished with exit code 0


5、pySpark学习地址
(1)http://spark.apache.org/docs/latest/api/python/pyspark.html(2)在上面解压的文件夹/spark-2.1.0-bin-hadoop2.6/examples/src/main/python中有很多示例代码,可以进行学习,本文中的wordCount就是用的上面的代码小修改版本。
6、查看Python安装目录及 第三方模块(modules)的安装位置因为只要知道python home路径就好办了。

示例如下:localhost:python a6$ pythonPython 2.7.10 (default, Feb  7 2017, 00:08:15)[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import sys>>> print sys.path['', '/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg', '/Library/Python/2.7/site-packages/py4j-0.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/redis-2.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/MySQL_python-1.2.4-py2.7-macosx-10.12-intel.egg', '/Library/Python/2.7/site-packages/thrift-0.10.0-py2.7-macosx-10.12-intel.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC’]

7、Windows安装pySpark,并且在pyCharm中调用例子参见网址:http://blog.csdn.net/a819825294/article/details/51782773

微信扫一扫

第七城市微信公众平台