Using AXS

With AXS you can cross-match, query, and analyze data from astronomical catalogs. This page provides a brief description of these features.

Cross-matching catalogs

After you have installed AXS, started AXS, and imported data from at least two catalogs, you can cross-match them.

First, you can get the list of tables in the AXS metastore using list_tables. The following snippet, for example, will print out the names of known tables:

db = AxsCatalog(spark)
for tname in db.list_tables():
    print(tname)

To load tables, you use AxsCatalog’s load method:

sdss = db.load('sdss')
gaia = db.load('gaia')

Both sdss and gaia are objects of type AxsFrame. You can cross-match the two (with 2 arc-second radius, as in the next example) using the crossmatch method:

gaia_sdss_cm = gaia.crossmatch(sdss, 2*Constants.ONE_ASEC,
    return_min=False, include_dist_col=True)

If you set return_min to True, only the nearest match will be returned for each row. include_dist_col determines if the calculated distance will be returned as one of the columns (named axsdist).

Now that you have cross-matching results in an AxsFrame, you can do further processing on the results using Spark and AXS APIs.

Querying catalog data

To query and process AXS catalog data you can use standard Spark API. However, one important concept in AXS is data duplication: some of the rows (those in border areas of each zone) are duplicated. Those rows will have the dup column set to 1. The dup column is used for quick cross-matching. But during normal data processing, this duplication is unwanted. You can exlude the duplicated rows using AxsFrame’s exclude_duplicates function:

gaia_clean = gaia.exclude_duplicates()

AXS provides several additional astronomy-specific methods:

  • region returns all objects from an AxsFrame inside a square specified by two points in the sky (and their ‘ra’ and ‘dec’ coordinates).
  • cone returns all objects from an AxsFrame falling in a cone specified by its central point and a radius.
  • With histogram you can specify a function (a new column definition) for calculating a value which will be used for binning the rows. You also need to set the number of bins to use.
  • With histogram2d you can do the same thing in two dimensions.

Running custom functions

Two methods deserve special mention: add_primitive_column(colname, coltype, func, *col_names) and add_column(colname, coltype, func, *col_names). They serve the same purpose: making it easier to run custom, user-defined data processing functions. The user-defined function should run on a single row at a time and produce one value as a result. add_primitive_column supports only output columns of primitive types (integer, float, string, and so on), but should be faster to run (it uses Spark’s @pandas_udf mechanism). add_column also supports complex columns (map, array, struct) and uses Spark’s @udf mechanism.

You have to provide the resulting column name and its type (colname and coltype arguments, respectively), the names of the columns whose values should serve as an input to the function (col_names argument), and the function itself (func).

The function for add_primitive_column needs to operate on Pandas.Series objects (of the same number as the number of col_names) and return a single Pandas.Series object of the type defined by coltype.

The function for add_column should be an ordinary Python function, returning a single object of the type defined by coltype.

Working with light-curve data

AXS stores light-curve data in array columns. The intention is to have one row per object and store the different measurements as elements of arrays. For example, a star would have a row with its Ra and Dec coordinates, possibly also an ID column, and 5 measurements of the star in these columns:

  • mjd - containing times at which measurements were taken
  • filter - containing filters that were used: r, g, r, etc.
  • mag - containing the actual measurements

Other columns (such as mag_err) could also be used, depending on the actual catalog.

To query these data AXS relies on Spark functionalities and adds two additional SQL functions: array_allpositions and array_select. array_allpositions takes an array column and a value and returns the positions of all the occurrences of the value in the given array column as an array of longs. array_select takes an array column and an array of indices (which can be another column or a value) and returns all elements from the column array corresponding to the provided indices.

These functions, enabling filtering of array columns, along with other Spark built-in functions should be adequate for most use cases.