spacekit.preprocessor.ingest¶

Jwst Calibration Data Ingest¶

At STSCI, additional model training data is acquired daily from the telescope’s calibration pipeline. Due to the nature of an automated 24-hour data collection cycle, some Level 3 products may still be processing at the time data is collected. This results in a given input file containing groups of L1 exposures with no matching L3 product. JwstCalIngest will run preprocessing on all L1 inputs and attempt to match them with an L3 product in the same file.

Any complete datasets (where a match is identified) are inserted into the “database”, a file called training.csv.
Any remaining L1 exposures that did not find a match are stored into a separate “table” called ingest.csv.

The next time this ingest process is run, the script will load both the new data as well as prior (unmatched) data. The assumption here is that the missing L3 product(s) (and sometimes even additional L1 exposures for this association) will eventually complete the pipeline and show up in subsequent files.

Additional output files are model-specific encoded subsets of preprocessed and ingest. Data is inserted into these in the same manner as appropriate. The actual files to be used for model training are named as train-{modelname}.csv, while training.csv contains all the original columns with unencoded values and is intended to be used primarily for data analysis and debugging purposes.

Database: {outpath}

Tables: {.csv files}

Accumulated data storing unencoded values

preprocessed: complete L1-L3 groupings

ingest: unmatched L1 exposures

mosaics: c1XXX association candidate L3 products (currently not supported)

Encoded datasets finalized and ready for model training (input features + y-targets)

train-image: L3 image model

train-spec: L3 spectroscopy model

train-tac: L3 TSO/AMI/CORON model

Encoded input features of remaining L1 exposures (y-targets pending)

rem-image.csv

rem-spec.csv

rem-tac.csv

class spacekit.preprocessor.ingest.JwstCalIngest(input_path=None, pfx='', outpath=None, save_l1=False, **log_kws)[source]¶

Loads raw JWST Calibration Pipeline metadata from local disk (input_path) and runs initial ML preprocessing steps necessary prior to model training. The resulting dataframes will be “ingested” into any pre-existing training sets located in outpath. This outpath acts as the primary database containing several “tables” (dataframes stored in .csv files). This class is designed to run on single or multiple files at a time (limit specificity using ‘pfx`).

Input file naming convention: YYYY-MM-DD_%d.csv (%d = day of year) ex: 2024-02-21_052.csv Alternate formats currently not supported because filenames are used to store date info. Examples:

To ingest multiple files from November 2023, set pfx="2023-11".
To ingest only one file from January 3, 2024, set pfx="2024-01-03".
You can also pass in a wildcard: pfx="*_3" would search for all data collected on days 300-365 of any year, while pfx="2023*_3" would do the same but only for the year 2023.

The contents of raw metadata files are expected to contain:

columns consistent with Fits header keyword-values used in JWST Cal model training (see spacekit.skopes.jwst.cal.config)

rows of Level 1/1b exposures (inputs/features) along with Level 3 products

imagesize (memory footprint) for each L3 product (outputs/target)

Parameters:

input_path (str (path), optional) – directory path to csv files on local disk, by default None (current working directory)
pfx (str, optional) – filename start pattern (e.g. “2023” or “*-12-), by default “”
outpath (str (path), optional) – directory path to save (and/or update) preprocessed files on local disk, by default None (current working directory)
save_l1 (bool, optional) – save matched level 1 input data to separate file, by default True

convert_imagesize_units(data=None)[source]¶

Converts the imagesize (memory footprint) column to Gigabyte units and stores the values in a new column named imgsize_gb for each exp_type in the self.data attribute (image, spec, etc). If the data kwarg is None, this change is also applied to the raw (unencoded) versions (self.raw). Otherwise the conversion is made to the dataframe passed into the data kwarg.

Parameters:: data (pandas.DataFrame, optional) – Apply the unit conversion to a particular dataframe instead of the default self.data, by default None
Returns:: Dataframe with additional column ‘imgsize_gb` containing the GB values converted from imagesize column.
Return type:: pd.DataFrame

drop_level2(df)[source]¶

Determines which dag column values relate to Level 1 and Level 3 according to their names, then drops any rows from the DataFrame that do not match these values. Note: starting on 6/13/2025, a change in the data collection process added a new dag value ‘ESTIMATE_LEVEL_3_MEMORY’ which is unrelated to the actual processing of a dataset on its designated server node and therefore rows matching this value are also removed.

Parameters:: df (pandas.DataFrame) – dataframe to search and modify
Returns:: dataframe with only L1 and L3 datasets
Return type:: pandas.DataFrame

drop_mosaics()[source]¶: Separate mosaic L3 products and save to mosaics.csv on local disk.

drop_unmatched()[source]¶: Store any unmatched inputs into the self.raw attribute then remove them from the training set. Reports a log of the percentage of L3 products successfully matched during this ingest run (anything less than 100% indicates an error).

extrapolate()[source]¶: Match each group of L1 input exposures to a single L3 product, then separate unmatched exposures from the dataframe and convert imagesize to gigabytes. If any L3 products remain unmatched, the preliminary assumption is that these datasets were reprocessed and an attempt is made to update the relevant features for this product within the existing training file stored on local disk at training.csv if it exists. Warnings are reported by the log if multiple L3 products match a particular group of L1 inputs and/or L3 products remain that could not be matched with any input exposures or a previous L3 product in the existing training set. In both cases, these products are stored as a list in the self.l3 attribute for further analysis and debugging since either occurrence indicates an error in the way data is being ingested (often as a result of unexpected changes made within the JWST pipeline after a given release).

get_unencoded()[source]¶

Retrieve the raw (unencoded) L3 products generated by the JWST Scrubber using preprocessed L1 exposure groups.

Returns:: Dictionary of each exp_type’s dataframe of raw (unencoded) L3 products generated based on groups of L1 input exposures run through the JWST Scubber.
Return type:: dict

ingest_data()[source]¶: Loads all relevant files to be ingested into a single dataframe, adding columns for date, year and day of year (doy) based on the file names to demarcate the file from which each dataset originated. Additionally, only observations relating to jwst calibration levels 1 and 3 are kept, while the rest are dropped.

initial_scrub()[source]¶: Initial preprocessing renames and adds several columns, sets the df index to Dataset, recasts datatypes, and drops the following: - older duplicates and exposure types known to be unrelated to Level 3 processing - redundant MIRI IFU products (only 1 channel per dataset is kept) - mosaics (estimates for L3 datasets used to create a mosaic accurately reflect compute requirements)

load_and_recast(dpath, idxcol=None)[source]¶

Loads in a dataframe from file on local disk generated by a prior ingest and recasts data types as needed for certain columns where that information is lost during a save.

Parameters:

dpath (str or Path) – path on local disk where file is stored
idxcol (str, optional) – custom index column name, by default None

Returns:

df loaded with columns recasted as necessary

Return type:

pandas.DataFrame

load_priors()[source]¶: Loads previously ingested but unmatched datasets from ‘ingest.csv’ file located in output_path on local disk. Checks the params column and extracts any that match the current ingest dataframe in order to attempt a new match. This is necessary for some datasets which take multiple days to complete processing.

static mark_mosaics(x)[source]¶

Identify mosaic L3 products based on the dataset’s name format.

Parameters:: x (str) – Dataset name
Returns:: True if the dataset name is a mosiac otherwise False
Return type:: bool

match_product_groups(exp_type)[source]¶

Matching L3 product with its associated L1 input exposures. 1. If TARGNAME: match using params (PID-OBS-OPTELEM-SUBARRAY-EXP_TYPE) + TARGNAME 2. Elif fixed target: match using params + targra (TARG_RA rounded to 6 sig. digits) 3. Else: match params + gs_mag

Parameters:: exp_type (str) – model-based ‘exp_type’ grouping: IMAGE, SPEC, TAC, or FGS

match_query(info, extra_param=None)[source]¶

Queries the dataframe for L3 products matching the shared metadata attributes for a group of L1 input exposures. If a value is passed into the extra_param kwarg, the query is further restricted to include products with a value matching this additional parameter. If this initial query returns 0 results, a second broader query without the additional param is automatically run. By default, the query attempts to find L3 products within the dataframe whose params column value matches that of the L1 inputs’ params column.

Parameters:

info (dict) – Key-value pairs of metadata pertaining to all L1 input exposures associated with a single L3 product.
extra_param (str, optional) – Column name to match against an additional parameter value within the dataframe, by default None

Returns:

L3 products matching the specified metadata (and query parameters if requested).

Return type:

list

read_files()[source]¶: Collects a list of filenames to be ingested from local disk according to the glob pattern combining input_path and pfx ending with csv. A warning is issued if no files matching the pattern are found. The list of files are stored in the class attribute files.

recast_dtypes(df)[source]¶

When loading a saved dataframe, some datatypes need to be recast appropriately in order to be able to edit existing / insert new values.

Parameters:: df (pandas.DataFrame) – dataframe to be recast
Returns:: recasted dataframe
Return type:: pandas.DataFrame

reduce_mirifu_channels()[source]¶: Append channel info to params string; drop MIRI IFU L3 products from channels 2,4 (keep only 1, 3). Channels 1-2 use the same input exposures, and the same goes for channels 3-4. NOTE: The memory footprint for each L3 product is the same regardless of channel or subchannel (‘band’), so the inclusion of L3 products from both channels 1 and 3 is likely to be redundant for ML training purposes. The resulting metadata features will show some variability between ch1/2 and 3/4 L3 products because the input exposures are distinct. Further analysis is needed to determine if such variability simply adds noise to the training set, and a decision should be made at training time whether or not to include both channels. Adjustments to inference preprocessing may need to be made so that the model simply ignores channel/subchannel altogether and treats the entire group of inputs as pertaining to a single L3 product (jw_PID_OBS_TRG_miri_). The memory footprint estimates for each individual channel/subchannel combination can be inferred from a single inference output and applied to all relevant ‘subproducts’.

save_ingest_data()[source]¶: Adds unmatched L1 inputs into ‘ingest.csv’, matched L3 products to ‘training.csv’. If save_l1 attribute is True, matched L1 input exposures are saved to a separate file ‘level1.csv’.

save_training_sets()[source]¶: Adds preprocessed ML training data for each model type to its respective file on local disk: train-{exp_type}.csv. The raw (unencoded) versions are also saved to local disk as raw-{exp_type}.csv. Any remaining L1 inputs that did not have a matching L3 product are saved to rem-{exp-type}.csv primarily for debugging purposes.

scrub_exposures()[source]¶: Preprocess the L1 input exposures through the JWST Scrubber. See JwstCalScrubber for details.

set_outpath(value=None)[source]¶

Initialize class variables relating to file paths on local disk where ingested data will be stored. If nothing is passed into the value kwarg, the default base path for outputs will be the same as inputs.

Parameters:: value (str or Path, optional) – custom path to a directory where output files will be saved, by default None

set_params()[source]¶: Creates a new dataframe column containing a concatenated string of keywords that uniquely identify a group of related L1 inputs and their L3 output. This is used (in combination with other columns such as targ_ra/dec to match L1 exposures with their L3 product). WFSC params are generated separately.

update_dags()[source]¶: Update the lists of l1 and l3 dag values once a collection of datasets from multiple files are ingested (including priors loaded from ingest.csv).

update_repro()[source]¶: Sometimes an L3 product is reprocessed and will not have any matching L1 inputs. Updates the imagesize and date attributes of the previous record (if found) with that of the new one.

class spacekit.preprocessor.ingest.SvmAlignmentIngest(input_path=None, outpath=None)[source]¶

Class for ingesting and preprocessing HST single visit mosaic alignment classifier datasets

Parameters:

input_path (str or Path, optional) – path on local disk to the input data, by default None
outpath (str or Path, optional) – path on local disk to save outputs, by default None

run_preprocessing(h5=None, fname='svm_data', visit=None)[source]¶

Scrapes SVM data from raw files, preprocesses dataframe for MLP classifier and generates png images for image CNN. #TODO: if no JSON files found, look for results_*.csv file instead and preprocess via alternative method

Parameters:

input_path (str) – path to SVM dataset directory
h5 (str, optional) – load from existing hdf5 file, by default None
fname (str, optional) – base filename to give the output files, by default “svm_data”
output_path (str, optional) – where to save output files. Defaults to current working directory, by default None
json_pattern (str, optional) – glob-based search pattern, by default “_total*_svm_.json”
visit (str, optional) – single visit name (e.g. “id8f34”) matching subdirectory of input_path; will search and preprocess this visit only (rather than all visits contained in the input_path), by default None
crpt (int, optional) – set to 1 if using synthetic corruption data, by default 0
draw (int, optional) – generate png images from dataset, by default 1

Returns:

preprocessed Pandas dataframe

Return type:

dataframe

spacekit.preprocessor.ingest.hst_svm_ingest(**kwargs)[source]¶: Main calling function for runnning HST SVM Alignment Data Ingest.

spacekit.preprocessor.ingest.jwst_cal_ingest(**kwargs)[source]¶: Main calling function for running JWST Calibration Data Ingest.