Introduction to AXS

AXS is a framework for processing data from astronomical catalogs based on Apache Spark. AXS stands for Astronomy eXtensions for Spark.

AXS was designed to make life easier for astronomers when they have to cruch data from astronomical catalogs (which is close to “all the time”). The way this process usually works is that an astronomer selects some data from a public or a private catalog using an SQL query, exports that data to some location and then writes custom (most likely Python) scripts to further analyze the data. Relying on the underlying Spark engine, AXS simplifies this process by allowing astronomers to query the data and run arbitrary processing on the results all within the same framework. There is usually no need for data export and separate programs.

Another thing that astronomers often need to do is to cross-match several catalogs: find the same objects from different catalogs based on distance calculations. This is computationally very expensive operation. AXS partitions the data in a way that makes cross-matching extremely efficient.

Spark specifics

If you wish to use AXS and you have never worked with Spark, you will have some adjusting to do because AXS is a wrapper around Spark, or Spark SQL to be exact. AxsFrame class is a wrapper around Spark’s DataFrame so all operations available on DataFrames are also available on AxsFrame.

There are a few concepts that a new Spark user might find confusing.

First of all, DataFrames (and AxsFrames) are lazy and will be evaluated and executed only when an action (such as show or count) is called.

Secondly, DataFrames are immutable. When you call a transformation (such as select or where) you obtain a new DataFrame. Stacking up transformations builds the DataFrame “program” to be executed later.

Because of all this, if this program results in an error, you will have to examine all the transformations that went into building the DataFrame and not just the last one (which would be the first, natural reaction).

Spark is a complex framework and execution engine and you would do well to invest some time into learning Spark basics if you intend to use AXS.

AXS data partitioning

AXS partitions the sky into horizontal strips of fixed height, called zones. Data is physically partitioned into buckets, which are ordinary (Parquet) files containing part of the data of the full dataset. Zones are sequentially placed into buckets. Since there are many more zones than buckets, a single bucket contains data from various non-neighboring zones. Within a bucket, data is sorted by zone (derived from the Dec coordinate) and then by the RA coordinate.

This setup enables fast spatial searches and fast cross-matching of different catalogs.