spacekit.preprocessor.scrub¶
- class spacekit.preprocessor.scrub.HstCalScrubber(data=None, output_path=None, output_file='batch.csv', dropnans=True, save_raw=True, **log_kws)[source]¶
Class for invoking initial preprocessing on HST Pipeline calibration metadata for training compute resource estimation models.
- Parameters:
data (pandas.DataFrame or dict, optional) – dataset to be scrubbed, by default None
output_path (str or Path, optional) – location to save preprocessed output files, by default None
output_file (str, optional) – file basename to assign preprocessed dataset, by default “batch.csv”
dropnans (bool, optional) – find and remove any NaNs, by default True
save_raw (bool, optional) – save data as csv on local disk before any encoding is performed, by default True
- class spacekit.preprocessor.scrub.HstSvmScrubber(input_path, data=None, output_path=None, output_file='svm_data', dropnans=True, save_raw=True, make_pos_list=True, crpt=0, make_subsamples=False, **log_kws)[source]¶
Class for applying standard preprocessing steps of Single Visit Mosaic regression test data.
- Parameters:
input_path (str or Path) – path to directory containing data input files
data (dataframe, optional) – dataframe containing raw inputs scraped from json (QA) files, by default None
output_path (str or Path, optional) – location to save preprocessed output files, by default None
output_file (str, optional) – file basename to assign preprocessed dataset, by default “svm_data”
dropnans (bool, optional) – find and remove any NaNs, by default True
save_raw (bool, optional) – save data as csv before any encoding is performed, by default True
make_pos_list (bool, optional) – create a text file listing misaligned (label=1) datasets, by default True
crpt (int, optional) – dataset contains synthetically corrupted data, by default 0
make_subsamples (bool, optional) – save a random selection of aligned (label=0) datasets to text file, by default False
- add_crpt_labels()[source]¶
For new synthetic datasets, adds “label” target column and assigns value of 1 to all rows.
- Returns:
self.df updated with label column (all values set = 1)
- Return type:
dataframe
- find_subsamples()[source]¶
Gets a varied sampling of dataframe observations and saves to local text file. This is one way of identifying a small subset for synthetic data generation.
- make_pos_label_list()[source]¶
Looks for target class labels in dataframe and saves a text file listing index names of positive class. Originally this was to automate moving images into class labeled directories.
- preprocess_data()[source]¶
Main calling function to run each preprocessing step for SVM regression data.
- class spacekit.preprocessor.scrub.JwstCalScrubber(input_path, data=None, pfx='', sfx='_uncal.fits', dropnans=False, save_raw=True, encoding_pairs=None, mode='fits', **log_kws)[source]¶
Class for invoking initial preprocessing of JWST calibration input data.
- Parameters:
input_path (str or path) – path on local disk where L1 input exposures are located
data (pd.DataFrame, optional) – dataframe of exposures to be preprocessed, by default None
pfx (str, optional) – limit scrape search to files starting with a given prefix such as ‘jw01018’, by default “”
sfx (str, optional) – limit scrape search to files ending with a given suffix, by default “_uncal.fits”
dropnans (bool, optional) – drop null value columns, by default False
save_raw (bool, optional) – save a copy of the dataframe before encoding, by default True
encoding_pairs (dict, optional) – preset key-value pairs for encoding categorical data, by default None
mode (str, optional) – determines how data is scraped and handled (‘fits’ for files or ‘df’ for dataframe), by default ‘fits’
miri_ifu_opts (dict, optional) – Optionally ignore channel and/or subchannel for MIRI IFU exposures. Setting both to False will consider exposures from all channels and subchannels of a given observation to be inputs for a single L3 product.
- fake_target_ids()[source]¶
Assigns a fake target ID using TARGNAME, TARG_RA or GS_MAG. These IDs are fake in that they’re unlikely to match actual target IDs assigned later in the pipeline. For source-based exposures, the id defaults to “s000000001” except in the case of NRC_WFSS parallel_pure which uses “t0”.
Grouping logic: - TARG_RA (rounded to 6 decimals): VISITYPE=PRIME_TARGETED_FIXED, TARGNAME=NaN - TARGNAME: VISITYPE != PRIME_TARGETED_FIXED, TARGNAME != NaN - GS_MAG : TARGNAME=NaN, GSMAG != NaN, VISITYPE != “PRIME_TARGETED_FIXED”, “PARALLEL_PURE”
Remaining groups not matching above parameters default to ‘t0’ (typically ‘parallel_pure’ visitypes).
- get_dtype_keys()[source]¶
Group input metadata into pre-set data types before applying NaNdlers. :returns: key-value pairs of data type and exposure header / column name :rtype: dict
- get_level3_products()[source]¶
Determines potential L3 products based on groups of input exposures with matching Fits keywords prog+obs+optelem+fxd_slit+subarray. These groups are further subdivided and assigned a fake target ID by TARGNAME, GS_MAG or TARG_RA.
- property input_data¶
Preprocessed input data grouped by exposure type :returns: input data grouped by exp_type (IMAGE, SPEC, FGS, TAC) :rtype: dict
- make_image_product_name(k, v, tnum)[source]¶
Parse through exposure metadata to create expected L3 image products. :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param tnum: number assigned to each unique target (targ_ra) within a program :type tnum: str
- make_spec_product_name(k, v, tnum)[source]¶
Parse through exposure metadata to create expected L3 spectroscopy products. NOTE: Although the pipeline would create multiple products for either source-based exposures or (channel-based) MIRI IFU exposures, only one product name will be created since the model is concerned with RAM, i.e. how large the memory footprint is to calibrate a set of input exposures. Source-based products use “s000000001” for the source; MIR_MRS exposures default to “ch1” or “ch3” for channel. Subchannel (“BAND”) is ignored other than for determining if exposures are MIRI IFU.
- make_tac_product_name(k, v, p)[source]¶
If an image or spec product meets the required conditions, it is added instead to the TAC products dictionary (Time-series, AMI, Coronagraph). :param k: exposure header key (L1 exposure name) :type k: str :param v: exposure header data :type v: dict :param p: product name :type p: str
- pixel_offsets()[source]¶
Generate the pixel offset between exposure reference pixels and the estimated L3 fiducial.
- rename_miri_mrs()[source]¶
DEPRECATED: Default behavior of JWST Pipeline >=1.17.0 now generates a separate L3 Product for each sub-channel (band). This class method will be removed in the next upcoming release.
- scrape_inputs()[source]¶
Scrape input exposure header metadata from fits files on local disk located at
self.input_path.
- scrub_inputs(exp_type='IMAGE')[source]¶
Main calling function for preprocessing input exposures of a given exposure type. :param exp_type: Exposure type, by default “IMAGE” :type exp_type: str, optional
- Returns:
preprocessed data with renamed columns, NaNs scrubbed and categorical data encoded
- Return type:
pd.DataFrame
- verify_target_groups()[source]¶
Certain L3 products need to be further defined by their L1 input TARG_RA values in addition to all other parameters. This only affects PRIME_TARGETED_FIXED visit types where TARGNAME != NaN. If multiple unique TARG_RA/DEC values (rounded to 6 digits) are identified within the group of exposures, we can assume each TARG grouping is a unique L3 product.
- class spacekit.preprocessor.scrub.Scrubber(data=None, col_order=None, output_path=None, output_file=None, dropnans=True, save_raw=True, name='Scrubber', **log_kws)[source]¶
Base parent class for preprocessing data. Includes some basic column scrubbing methods for pandas dataframes. The heavy lifting is done via subclasses.
- Parameters:
data (pandas.DataFrame or dict, optional) – dataset to be scrubbed, by default None
col_order (list, optional) – order input feature columns, by default None
output_path (str or Path, optional) – path on local disk to save scrubbed dataset, by default None
output_file (str, optional) – name to give scrubbed dataset file, by default None
dropnans (bool, optional) – find and remove any NaNs, by default True
save_raw (bool, optional) – save data as csv on local disk before any encoding is performed, by default True
name (str, optional) – logger name (mutable for subclasses), by default “Scrubber”