AxsFrame

class axs.axsframe.AxsFrame(dataframe, table_info)[source]

AxsFrame is an extended Spark’s DataFrame with additional methods useful for querying astronomical catalogs. It typically represents a Parquet Spark table containing properly bucketed (and sorted) data from an astronomical catalog, or a subset of such table.

All Spark DataFrame methods which normaly return a DataFrame (such as select or withColumn) will return an AxsFrame when called on an AxsFrame instance.

Note

One caveat, which results from AXS’ partitioning and indexing scheme, is that you always need to make sure that you keep the zone and ra columns in AXS frames that you wish to later save or cross-match with other frames.

add_column(colname, coltype, func, *col_names)[source]

Add a column colname of type coltype to this AxsFrame object by executing function func on columns col_names. Function func has to accept the exact number of arguments as the number of columns specified. The function will be applied to the dataframe data row by row. The arguments passed to the function will match types of the specified columns and the function needs to return an object of type represented by string coltype.

This method will use @udf in the background (see pyspark.functions.udf).

This function is slower than add_column but can produce array or map columns.

Parameters:
  • colname – The name of the new column.
  • coltype – The name of the new column’s type. (See pyspark.sql.types)
  • func – The function which will produce data for the new column. Needs to return a single object of the

type defined by coltype. :param col_names: The columns whose data will be supplied to the function func. :return: Returns a new AxsFrame with the new column added.

add_primitive_column(colname, coltype, func, *col_names)[source]

Add a column colname of type coltype to this AxsFrame object by executing function func on columns col_names. Function func has to accept the exact number of arguments as the number of columns specified. The arguments will be of type Pandas.Series and the function needs to return a Pandas.Series object of type represented by string coltype.

This method will use @pandas_udf in the background (see pyspark.functions.pandas_udf).

The type of the new column needs to be a primitive type (int, double, etc.). The input columns can be of complex types.

This function is faster than add_column but cannot produce array or map columns.

Parameters:
  • colname – The name of the new column.
  • coltype – The name of the new column’s type. (See pyspark.sql.types)
  • func – The function which will produce data for the new column. Needs to operate on Pandas.Series objects

(of the same number as the number of col_names) and return a single Pandas.Series object of the type defined by coltype. :param col_names: The columns whose data will be supplied to the function func. :return: Returns a new AxsFrame with the new column added.

agg(*exprs)[source]
alias(alias)[source]
approxQuantile(col, probabilities, relativeError)[source]
cache()[source]
checkpoint(eager=True)[source]
coalesce(numPartitions)[source]
colRegex(colName)[source]
collect()[source]
columns
cone(ra, dec, r, remove_duplicates=False)[source]

Selects the rows containing objects falling into a ‘cone’ defined by ra and dec center point and the radius r.

Parameters:
  • ra – Ra of the center point. Has to be inside [0, 360).
  • dec – Dec of the center point. Has to be inside [-90, 90].
  • r – Radius of the search cone in degrees. Has to be inside (0, 90].
  • remove_duplicates – If True, duplicated rows will be removed from the resulting AxsFrame.

If you plan to crossmatch the results with another AxsFrame, leave this at False. :return: The AxsFrame containing the rows found inside the search cone.

corr(col1, col2, method=None)[source]
count()[source]
cov(col1, col2)[source]
createGlobalTempView(name)[source]
createOrReplaceGlobalTempView(name)[source]
createOrReplaceTempView(name)[source]
createTempView(name)[source]
crossmatch(axsframe, r=0.0002777777777777778, return_min=False, include_dist_col=True)[source]

Performs the cross-match operation between this AxsFrame and axsframe, which can be either an AxsFrame or a Spark’s DataFrame, using r for the cross-matching radius (one arc-second by default).

Both frames need to have zone, ra, dec, and dup columns.

Bote that if axsframe is a Spark frame, the cross-match operation will not be optimized and might take a much longer time to complete.

The best performance can be expected when both tables are read directly as AxsFrames. In that scenario cross-matching will be done on bucket pairs in parallel without data movement between executors. If, however, one of the two AxsFrames being cross-matched is the result of a groupBy operation, for example, data movement cannot be avoided. In those cases, it might prove faster to first save the “grouped by” table using save_axs_table and then do the cross-match.

Warning

The two AXS tables being cross-matched need to have the same zone height and the same number of buckets. The results are otherwise unreliable.

The resulting table will contain all colunms from both input tables.

Parameters:
  • axsframe – The AXS frame or Spark DataFrame to be used cross-matched
  • r – The search radius used for cross-matching. One arc-second by default.
  • return_min – Whether to return only a single cross-match result per row, or all the matching objects

within the specified cross-match radius. :param include_dist_col: Whether to include the calculated distance column (“axsdist”) in the results. Default is True. :return: AxsFrame containing cross-matching results with all columns from both frames.

crosstab(col1, col2)[source]
cube(*cols)[source]
describe(*cols)[source]
distinct()[source]
drop(*cols)[source]
dropDuplicates(subset=None)[source]
dropna(how='any', thresh=None, subset=None)[source]
dtypes
exclude_duplicates()[source]

Removes the duplicated data (where dup is equal to 1) from the AxsFrame.

The AxsFrame needs to contain the dup column.

Returns:The AxsFrame with duplicated data removed.
explain(extended=False)[source]
fillna(value, subset=None)[source]
filter(condition)[source]
first()[source]
foreach(f)[source]
foreachPartition(f)[source]
freqItems(cols, support=None)[source]
get_table_info()[source]
groupBy(*cols)[source]
groupby(*cols)[source]
head(n=None)[source]
histogram(cond, numbins)[source]

Uses cond column expression to obtain data for histogram calculation. The data will be binned into numbins bins.

Parameters:
  • cond – Column expression determining the data.
  • numbins – Number of bins.
Returns:

AxsFrame containing row counts per bin.

histogram2d(cond1, cond2, numbins1, numbins2, min1=None, max1=None, min2=None, max2=None)[source]

Uses cond1 and cond2 colunm expressions to obtain data for 2D histogram calculation. The data on x axis will be binned into numbins1 bins. The data on y axis will be binned into numbins2 bins. If min1, max1, min2 or max2 are not spacified, they will be calculated using an additional pass through the data. The method returns x, y and z 2-D numpy arrays (see numpy.mgrid) which can be used as an input to matplotlib.pcolormesh.

Parameters:
  • cond1 – Column expression determining the data on x axis.
  • cond2 – Column expression determining the data on y axis.
  • numbins1 – Number of bins for x axis.
  • numbins2 – Number of bins for y axis.
  • min1 – Optional minimum value for x axis data.
  • max1 – Optional maximum value for x axis data.
  • min2 – Optional minimum value for y axis data.
  • max2 – Optional maximum value for y axis data.
Returns:

x, y, z 2-D numpy “meshgrid” arrays (see numpy.mgrid)

intersect(other)[source]
isLocal()[source]
isStreaming
join(other, on=None, how=None)[source]
limit(num)[source]
na
persist(storageLevel=<sphinx.ext.autodoc.importer._MockObject object>)[source]
printSchema()[source]
randomSplit(weights, seed=None)[source]
rdd
region(ra1=0, dec1=-90, ra2=360, dec2=90, spans_prime_mer=False)[source]

Selects only the rows containing objects with RA between ra1 and ra2 and Dec between dec1 and dec2.

If spans_prime_mer is set to True, RA will be selected as ( >= ra1 OR <= ra2).

Parameters:
  • ra1 – Lower RA bound.
  • dec1 – Lower Dec bound.
  • ra2 – Upper RA bound.
  • dec2 – Upper Dec bound.
  • spans_prime_mer – Whether RA band spans the prime meridian (0 deg).
Returns:

The AxsFrame containing the resulting data.

registerTempTable(name)[source]
replace(to_replace, value=<sphinx.ext.autodoc.importer._MockObject object>, subset=None)[source]
rollup(*cols)[source]
sample(withReplacement=None, fraction=None, seed=None)[source]
sampleBy(col, fractions, seed=None)[source]
save_axs_table(tblname, calculate_zone=False)[source]

Persists the AxsFrame under the name tblname to make it available for loading in the future. The table will be available under this name to all Spark applications using the same metastore.

If calculate_zone is set to True, the zone column used for bucketing the data must not exist as it will be added before saving. If calculate_zone is True, the dup column will also be added and the data from border stripes in neighboring zones will be duplicated to the zone below. If calculate_zone is False, the zone and dup columns need to already be present.

The AxsFrame also needs to have the ra column because it will be used for data indexing along with the zone column.

See also catalog.save_axs_table().

Parameters:
  • tblname – Table name to use for persisting.
  • calculate_zone – Whether to calculate the zone column.
schema
select(*cols)[source]
selectExpr(*expr)[source]
show(n=20, truncate=True, vertical=False)[source]
sort(*cols, **kwargs)[source]
sortWithinPartitions(*cols, **kwargs)[source]
sql_ctx
stat
subtract(other)[source]
take(num)[source]
toDF(*cols)[source]
toJSON(use_unicode=True)[source]
toPandas()[source]
union(other)[source]
unionAll(other)[source]
unionByName(other)[source]
unpersist(blocking=False)[source]
where(condition)[source]
withColumn(colName, col)[source]
withColumnRenamed(existing, new)[source]
write
writeStream