Portable Python Virtual Environment for Pyspark

pyspark Jan 19, 2021

On clustered environment, we face lot of issues with the python version available on the nodes, if we are shipping our product in that case we had to perform lot of sanity test pre-deployment to make sure our application will run as per our expectation but we can't cover all scenarios and hence there is high chance of hitting issue.

So we thought of a better way and come up with an idea of shipping our own python version with everything preinstalled in that package, everyone might have been familiar with Virtual Environment or Anaconda but believe me after reading this you would get something new to learn.

Before we proceed it's require to understand the basic structure of python:

├── bin
│ ├── activate
│ ├── activate.csh
│ ├── activate.fish
│ ├── activate_this.py
│ ├── easy_install
│ ├── easy_install-3.6
│ ├── pip
│ ├── pip3
│ ├── pip3.6
│ ├── python
│ ├── python-config
│ ├── python3 -> python
│ ├── python3.6 -> python
│ └── wheel
├── include
│ └── python3.6m -> /usr/include/python3.6m
├── lib
│ └── python3.6
| ├── site-packages
│ ├── lib-dynload -> /usr/lib/python3.6/lib-dynload [Dynamic Library]

Environment Variables:

PYSPARK_PYTHON : Points to the executable python file: bin/python
LD_LIBRARY_PATH : Points to the dynamic library path: lib/python3.6/lib-dynload [All .so* files]
PYTHONPATH : Points to the installed packages within virtual environment as well as the dynamic library path : lib/python3.6/site-packages<CPS>lib/python3.6/lib-dynload [All .py and .so files]
PYTHONHOME : Points to the python library path: lib/python3.6/site-packages

Steps to build Virtual environment:

  1. Install python in the machine of desired version.
  2. Create Virtual Env
virtualenv env -/usr/local/bin/python3

3. Activate Virtual Env

source env/bin/activate

4. Install requirements

pip install numpy
  1. Now here is the trick, you can see this line ├── lib-dynload -> /usr/lib/python3.6/lib-dynload it's a symbolic link and pointing to the local machine path and hence even if you just zip this virtual environment folder then these dependencies would be missing on the cluster.
  2. So, it's required to copy all the .so* files from /usr/lib/python3.6/lib-dynload, /usr/lib64/*.so.*, etc... to lib/python3.6/lib-dynload [Be careful about  /usr/lib64/*.so.*, it does contain os specific libs, which may fail on different os versions, hence try to avoid so files from this specific folder].
  3. Copy all the .py and .so files from /usr/lib/python3.6/lib-dynload, /usr/lib64/*.so.*, etc... to lib/python3.6/site-packages.

Run it from the home dir of virtual environment in our case it's env/Prepare zip

zip -rq ../venv.zip *

Test your package

cd venv
 
export PYTHONPATH=lib64/python3.6/site-packages:lib64/python3.6/lib-dynload/
 
export LD_LIBRARY_PATH=lib64/python3.6/lib-dynload
 
source bin/activate

Environmental variable setup

For driver: spark.yarn.appMasterEnv.[Environment variable]

For executor: spark.executorEnv.[Environment variable]

PYSPARK_PYTHON

  1. spark.yarn.appMasterEnv.PYSPARK_PYTHON = venv/bin/python
  2. spark.executorEnv.PYSPARK_PYTHON = venv/bin/python

PYTHONHOME

  1. spark.yarn.appMasterEnv.PYTHONHOME = venv/lib64/python3.6/site-packages
  2. spark.executorEnv.PYTHONHOME = venv/lib64/python3.6/site-packages

LD_LIBRARY_PATH

  1. spark.yarn.appMasterEnv.LD_LIBRARY_PATH = venv/lib64/python3.6/lib-dynload
  2. spark.executorEnv.LD_LIBRARY_PATH = venv/lib64/python3.6/lib-dynload

PYTHONPATH

This need to included in YARN-ENV-ENTRIES, it's not getting set from the spark configs.

PYTHONPATH = {{PWD}}/__venv__.zip<CPS>{{PWD}}/__py4j-0.10.7-src__.zip<CPS>venv/lib64/python3.6/site-packages<CPS>venv/lib64/python3.6/lib-dynload<CPS>

Tags

Kshitij

Lead Engineer at Tookitaki