spacekit.preprocessor.scrub

class spacekit.preprocessor.scrub.HstCalScrubber(data=None, output_path=None, output_file='batch.csv', dropnans=True, save_raw=True, **log_kws)[source]

Class for invoking initial preprocessing on HST Pipeline calibration metadata for training compute resource estimation models.

Parameters:
  • data (pandas.DataFrame or dict, optional) – dataset to be scrubbed, by default None

  • output_path (str or Path, optional) – location to save preprocessed output files, by default None

  • output_file (str, optional) – file basename to assign preprocessed dataset, by default “batch.csv”

  • dropnans (bool, optional) – find and remove any NaNs, by default True

  • save_raw (bool, optional) – save data as csv on local disk before any encoding is performed, by default True

class spacekit.preprocessor.scrub.HstSvmScrubber(input_path, data=None, output_path=None, output_file='svm_data', dropnans=True, save_raw=True, make_pos_list=True, crpt=0, make_subsamples=False, **log_kws)[source]

Class for applying standard preprocessing steps of Single Visit Mosaic regression test data.

Parameters:
  • input_path (str or Path) – path to directory containing data input files

  • data (dataframe, optional) – dataframe containing raw inputs scraped from json (QA) files, by default None

  • output_path (str or Path, optional) – location to save preprocessed output files, by default None

  • output_file (str, optional) – file basename to assign preprocessed dataset, by default “svm_data”

  • dropnans (bool, optional) – find and remove any NaNs, by default True

  • save_raw (bool, optional) – save data as csv before any encoding is performed, by default True

  • make_pos_list (bool, optional) – create a text file listing misaligned (label=1) datasets, by default True

  • crpt (int, optional) – dataset contains synthetically corrupted data, by default 0

  • make_subsamples (bool, optional) – save a random selection of aligned (label=0) datasets to text file, by default False

add_crpt_labels()[source]

For new synthetic datasets, adds “label” target column and assigns value of 1 to all rows.

Returns:

self.df updated with label column (all values set = 1)

Return type:

dataframe

find_subsamples()[source]

Gets a varied sampling of dataframe observations and saves to local text file. This is one way of identifying a small subset for synthetic data generation.

make_pos_label_list()[source]

Looks for target class labels in dataframe and saves a text file listing index names of positive class. Originally this was to automate moving images into class labeled directories.

preprocess_data()[source]

Main calling function to run each preprocessing step for SVM regression data.

scrub_columns()[source]

Initial dataframe scrubbing to extract and rename columns, drop NaNs, and set the index.

scrub_qa_summary(csvfile='single_visit_mosaics*.csv', idx=0)[source]

Alternative if no .json files available (QA step not run during processing)

class spacekit.preprocessor.scrub.JwstCalScrubber(input_path, data=None, pfx='', sfx='_uncal.fits', dropnans=False, save_raw=True, encoding_pairs=None, mode='fits', **log_kws)[source]

Class for invoking initial preprocessing of JWST calibration input data.

Parameters:
  • input_path (str or path) – path on local disk where L1 input exposures are located

  • data (pd.DataFrame, optional) – dataframe of exposures to be preprocessed, by default None

  • pfx (str, optional) – limit scrape search to files starting with a given prefix such as ‘jw01018’, by default “”

  • sfx (str, optional) – limit scrape search to files ending with a given suffix, by default “_uncal.fits”

  • dropnans (bool, optional) – drop null value columns, by default False

  • save_raw (bool, optional) – save a copy of the dataframe before encoding, by default True

  • encoding_pairs (dict, optional) – preset key-value pairs for encoding categorical data, by default None

  • mode (str, optional) – determines how data is scraped and handled (‘fits’ for files or ‘df’ for dataframe), by default ‘fits’

  • miri_ifu_opts (dict, optional) – Optionally ignore channel and/or subchannel for MIRI IFU exposures. Setting both to False will consider exposures from all channels and subchannels of a given observation to be inputs for a single L3 product.

fake_target_ids()[source]

Assigns a fake target ID using TARGNAME, TARG_RA or GS_MAG. These IDs are fake in that they’re unlikely to match actual target IDs assigned later in the pipeline. For source-based exposures, the id defaults to “s000000001” except in the case of NRC_WFSS parallel_pure which uses “t0”.

Grouping logic: - TARG_RA (rounded to 6 decimals): VISITYPE=PRIME_TARGETED_FIXED, TARGNAME=NaN - TARGNAME: VISITYPE != PRIME_TARGETED_FIXED, TARGNAME != NaN - GS_MAG : TARGNAME=NaN, GSMAG != NaN, VISITYPE != “PRIME_TARGETED_FIXED”, “PARALLEL_PURE”

Remaining groups not matching above parameters default to ‘t0’ (typically ‘parallel_pure’ visitypes).

get_dtype_keys()[source]

Group input metadata into pre-set data types before applying NaNdlers. :returns: key-value pairs of data type and exposure header / column name :rtype: dict

get_level3_products()[source]

Determines potential L3 products based on groups of input exposures with matching Fits keywords prog+obs+optelem+fxd_slit+subarray. These groups are further subdivided and assigned a fake target ID by TARGNAME, GS_MAG or TARG_RA.

property input_data

Preprocessed input data grouped by exposure type :returns: input data grouped by exp_type (IMAGE, SPEC, FGS, TAC) :rtype: dict

make_image_product_name(k, v, tnum)[source]

Parse through exposure metadata to create expected L3 image products. :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param tnum: number assigned to each unique target (targ_ra) within a program :type tnum: str

make_spec_product_name(k, v, tnum)[source]

Parse through exposure metadata to create expected L3 spectroscopy products. NOTE: Although the pipeline would create multiple products for either source-based exposures or (channel-based) MIRI IFU exposures, only one product name will be created since the model is concerned with RAM, i.e. how large the memory footprint is to calibrate a set of input exposures. Source-based products use “s000000001” for the source; MIR_MRS exposures default to “ch1” or “ch3” for channel. Subchannel (“BAND”) is ignored other than for determining if exposures are MIRI IFU.

Parameters:
  • k (str) – exposure header key (L1 exposure name)

  • v (dict) – exposure header data

  • tnum (str) – number assigned to each unique target (targ_ra) within a program

make_tac_product_name(k, v, p)[source]

If an image or spec product meets the required conditions, it is added instead to the TAC products dictionary (Time-series, AMI, Coronagraph). :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param p: product name :type p: str

pixel_offsets()[source]

Generate the pixel offset between exposure reference pixels and the estimated L3 fiducial.

rename_miri_mrs()[source]

DEPRECATED: Default behavior of JWST Pipeline >=1.17.0 now generates a separate L3 Product for each sub-channel (band). This class method will be removed in the next upcoming release.

scrape_inputs()[source]

Scrape input exposure header metadata from fits files on local disk located at self.input_path.

scrub_inputs(exp_type='IMAGE')[source]

Main calling function for preprocessing input exposures of a given exposure type. :param exp_type: Exposure type, by default “IMAGE” :type exp_type: str, optional

Returns:

preprocessed data with renamed columns, NaNs scrubbed and categorical data encoded

Return type:

pd.DataFrame

verify_target_groups()[source]

Certain L3 products need to be further defined by their L1 input TARG_RA values in addition to all other parameters. This only affects PRIME_TARGETED_FIXED visit types where TARGNAME != NaN. If multiple unique TARG_RA/DEC values (rounded to 6 digits) are identified within the group of exposures, we can assume each TARG grouping is a unique L3 product.

class spacekit.preprocessor.scrub.Scrubber(data=None, col_order=None, output_path=None, output_file=None, dropnans=True, save_raw=True, name='Scrubber', **log_kws)[source]

Base parent class for preprocessing data. Includes some basic column scrubbing methods for pandas dataframes. The heavy lifting is done via subclasses.

Parameters:
  • data (pandas.DataFrame or dict, optional) – dataset to be scrubbed, by default None

  • col_order (list, optional) – order input feature columns, by default None

  • output_path (str or Path, optional) – path on local disk to save scrubbed dataset, by default None

  • output_file (str, optional) – name to give scrubbed dataset file, by default None

  • dropnans (bool, optional) – find and remove any NaNs, by default True

  • save_raw (bool, optional) – save data as csv on local disk before any encoding is performed, by default True

  • name (str, optional) – logger name (mutable for subclasses), by default “Scrubber”

save_csv_file(df=None, pfx='', index_col='index')[source]

Saves dataframe to csv file on local disk.

Parameters:

pfx (str, optional) – Insert a prefix at start of filename, by default “”

Returns:

self.data_path where file is saved on disk.

Return type:

str