Starting AXS

After you have installed AXS, you have three options how to run AXS:

  • using pyspark shell
  • from a Python program
  • from a Jupyter notebook

All three options are determined by different ways of starting a Spark session, which is required for running AXS. And in all three options you have to decide which Spark “master” (or the type of Spark cluster) to use. The easiest way is to run it locally, using local[n] “master” (where n is the number of tasks you wish to run in parallel). Other ways are to start a Standalone Spark cluster and connect to it, or connect to an existing YARN or Mesos cluster. There is also an option to run Spark in Kubernetes.

Starting AXS using pyspark shell

Start pyspark specifying master and other options as you see fit. For example:

pyspark --master=local[10] --driver-memory=64g

Then, inside the shell, create an instance of AXS catalog, providing the SparkSession instance already initialized in the shell:

from axs import AxsCatalog
db = AxsCatalog(spark)

Starting AXS using a Python program

To use AXS from a Python program, first set the SPARK_HOME environment variable to the directory where you installed AXS’ version of Spark. For example:

export SPARK_HOME=/opt/axs-dist

To connect to a Spark Standalone cluster, first start it from the AXS installation directory:

sbin/start-all.sh

Then, after you start an ordinary Python shell, you will first need to create an instance of SparkSession and connect to the Spark cluster. For example:

spark = SparkSession.builder.\
    master("spark://localhost:7078").\
    config("spark.cores.max", "10"). \
    config("spark.executor.cores", "1").\
    config("spark.driver.memory", "12g"). \
    config("spark.executor.memory", "12g"). \
    appName("AXS application").getOrCreate()
from axs import AxsCatalog
db = AxsCatalog(spark)

In this example the Spark session is pointing to a Spark standalone master listening at port 7078 at localhost. Of course, you can also specify the master to be local[10], as in the previous example.

Starting AXS from a Jupyter notebook

To correctly setup a Jupyter environment, it is best to create a new Jupyter kernel and then edit the corresponding kernel.json file. You can get a list of current kernels using the following command:

jupyter kernelspec list

Then edit the kernel.json file in the dispalyed folder (e.g. /usr/local/share/jupyter/kernels/mypythonkernel/kernel.json) and change the variables in the env section. Here as a working example when AXS in installed in /opt/axs-dist:

{
 "argv": [
  "/opt/anaconda/bin/python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "AXS",
 "language": "python",
 "env": {
  "SPARK_HOME": "/opt/axs-dist",
  "PYLIB": "/opt/axs-dist/python/lib:/opt/spark-axs/python/lib/py4j-0.10.7-src.zip",
  "PYTHONPATH": "/opt/axs-dist/python:/opt/anaconda/lib/python36.zip:/opt/anaconda/lib/python3.6:/opt/anaconda/lib/python3.6/lib-dynload:/opt/anaconda/lib/python3.6
/site-packages:/opt/anaconda/lib/python3.6/site-packages/IPython/extensions",
  "JUPYTERPATH": "/opt/axs-dist/python:/epyc/opt/anaconda/lib/python36.zip:/opt/anaconda/lib/python3.6:/opt/anaconda/lib/python3.6/lib-dynload:/opt/anaconda/lib/python3.
6/site-packages:/opt/anaconda/lib/python3.6/site-packages/IPython/extensions",
  "PATH": "/opt/axs-dist/python/lib/py4j-0.10.7-src.zip:/opt/spark-axs/bin:/epyc/opt/jdk1.8.0_161/bin:/opt/anaconda/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/local/sbin:/usr/local/
bin:/usr/sbin:/usr/bin:/var/lib/jupyterhub/miniconda/bin",
  "JAVA_HOME": "/opt/jdk1.8.0_161"
 }
}

Then start a new notebook with the kernel you just edited. After that, you can start a Spark session and create an instance of AxsCatalog just as you did in the Python case above.