AxsCatalog

class axs.catalog.AxsCatalog(spark_session=None)[source]

Implements high-level operations on AXS tables, such as loading, saving, renaming and dropping. AXS tables are Spark Parquet tables, but are bucketed and sorted in a special way. AxsCatalog relies on Spark Catalog for loading table data and it needs an active SparkSession object (with Hive support enabled) for initialization. This is also necessary because information about available AXS tables is persisted in a special AXS table in Spark metastore DB.

DEC_COLNAME = 'dec'
DUP_COLNAME = 'dup'
NGBR_BORDER_HEIGHT = 0.002777777777777778
NUM_BUCKETS = 500
RA_COLNAME = 'ra'
ZONE_COLNAME = 'zone'
ZONE_HEIGHT = 0.016666666666666666
add_increment(table_name, increment_df, rename_to=None, temp_tbl_name=None)[source]

Adds a new batch of data contained in the increment_df DataFrame (or AxsFrame) to the persisted AXS table table_name. The old table will be renamed to rename_to, if set, or to “table_name + _YYYYMMDDhhmm” otherwise. The data will be first saved into temp_tbl_name before renaming the main table.

increment_df needs to have the same schema as the table table_name.

Parameters:
  • table_name – The table to which to add the new data.
  • increment_df – The data to add to the existing table. Needs to have the appropriate schema.
  • rename_to – New table name for the original data.
  • temp_tbl_name – Temporary table name to use (table_name + “_temp” is the default) before rename operation.
Returns:

The table name to which the original table has been renamed

calculate_zone(df, zone_height=0.016666666666666666)[source]

Adds zone and dup columns to the df DataFrame. df needs to have a dec column for calculating zones and must not already have zone and dup columns. Data in the lower border strip of each zone is duplicated to the zone below it. dup column of those rows contains 1 and 0 otherwise.

Parameters:
  • df – The input DataFrame for which to calculate
  • zone_height – Zone height to be used for data partitioning
Returns:

The new AxsFrame

drop_table(table_name, drop_spark=True)[source]

Drops a table from both AXS and Spark catalogs.

Parameters:
  • table_name – Table to drop.
  • drop_spark – Whether to drop the table in Spark catalog, too. Default is True.
import_existing_table(table_name, path, num_buckets=500, zone_height=0.016666666666666666, import_into_spark=True, update_spark_bucketing=True, bucket_col='zone', ra_col='ra', dec_col='dec', lightcurves=False, lc_cols=None)[source]

Imports an existing, properly bucketed and sorted Parquet file into AXS catalog. If import_into_spark is True, the table will also be imported into Spark catalog.

bucket_col, ra_col and dec_col allow for changing the default bucketing and sorting columns.

If lightcurves is True, then some columns are expected to be array columns. Those are specified in lc_cols. array.

Parameters:
  • table_name – The table name into which to import the data.
  • path – The path to the bucketed Parquet file to be imported (needed only for Spark import).
  • num_buckets – Number of buckets in the input Parquet file.
  • zone_height – Zone height used for data partitioning.
  • import_into_spark – Whether to also import the table into Spark. If False, the table should already exist in Spark metastore.
  • update_spark_bucketing – Whether to also update bucketing info in Spark metastore.
  • bucket_col – The column used for data bucketing.
  • ra_col – The name of column containing RA coordinates.
  • dec_col – The name of column containing DEC coordinates.
  • lightcurves – Whether the table contains lightcurve data as array columns.
  • lc_cols – Comma-separated list of names of array columns containing lightcurve data.
list_tables()[source]

Returns a list of a known AxsFrame tables as a dictionary where the keys are table names and the values are again dictionaries with these fields: - table_id - Internal table ID - table_name - Name of the table - num_buckets - Number of buckets used for partitioning the table data - zone_height - Zone height used for data partitioning - bucket_col - the column name used for bucketing - ra_col - the column containing RA coordinates - dec_col - the column containing DEC coordinates - has_lightcurves - whether the table contains lightcurve data as array columns - lc_columns - a list of array columns containing lightcurve data

load(table_name)[source]

Loads a known AXS table from a Spark catalog and returns it as an AxsFrame.

rename_table(table_name, new_name)[source]

Renames an AxsTable table_name to new_name. Also renames the table in the Spark catalog.

Parameters:
  • table_name – An existing table to rename.
  • new_name – The new name
save_axs_table(df, tblname, repartition=True, calculate_zone=False, num_buckets=500, zone_height=0.016666666666666666)[source]

Saves a Spark DataFrame as an AxsFrame under the name tblname. Also saves the table as a Spark table in the Spark catalog. The table will be bucketed into AxsCatalog.NUM_BUCKETS buckets, each bucket sorted by zone and ra columns.

Note: If saving intermediate results from cross-matching two AxsFrames the DataFrame should already be partitioned appropriately. repartition should then be set to False to speed things up.

To obtained the saved table, use the load() method.

Parameters:
  • df – Spark DataFrame or AxsFrame to save as AXS table.
  • tblname – Table name to use for saving.
  • repartition – Whether to repartition the data by zone before saving.
  • calculate_zone – Whether to first add zone and dup columns to df.
  • num_buckets – Number of buckets to use for data partitioning.
table_info(table_name)[source]

Returns a known AxsFrame table info as a dictionary with these keys: - table_id - Internal table ID - table_name - Name of the table - num_buckets - Number of buckets used for partitioning the table data - zone_height - Zone height used for data partitioning - bucket_col - the column name used for bucketing - ra_col - the column containing RA coordinates - dec_col - the column containing DEC coordinates - has_lightcurves - whether the table contains lightcurve data as array columns - lc_columns - a list of array columns containing lightcurve data

Parameters:table_name – Table name for which to fetch info.