spacekit.preprocessor.ingest¶
Jwst Calibration Data Ingest¶
At STSCI, additional model training data is acquired daily from the telescope’s calibration pipeline. Due to the nature of an automated 24-hour data collection cycle, some Level 3 products may still be processing at the time data is collected. This results in a given input file containing groups of L1 exposures with no matching L3 product. JwstCalIngest will run preprocessing on all L1 inputs and attempt to match them with an L3 product in the same file.
Any complete datasets (where a match is identified) are inserted into the “database”, a file called
training.csv.Any remaining L1 exposures that did not find a match are stored into a separate “table” called
ingest.csv.
The next time this ingest process is run, the script will load both the new data as well as prior (unmatched) data. The assumption here is that the missing L3 product(s) (and sometimes even additional L1 exposures for this association) will eventually complete the pipeline and show up in subsequent files.
Additional output files are model-specific encoded subsets of preprocessed and ingest. Data is inserted into these in the same manner as appropriate. The actual files to be used for model training are named as train-{modelname}.csv, while training.csv contains all the original columns with unencoded values and is intended to be used primarily for data analysis and debugging purposes.
Database: {outpath}
- Tables: {.csv files}
- Accumulated data storing unencoded values
preprocessed: complete L1-L3 groupings
ingest: unmatched L1 exposures
mosaics: c1XXX association candidate L3 products (currently not supported)
- Encoded datasets finalized and ready for model training (input features + y-targets)
train-image: L3 image model
train-spec: L3 spectroscopy model
train-tac: L3 TSO/AMI/CORON model
- Encoded input features of remaining L1 exposures (y-targets pending)
rem-image.csv
rem-spec.csv
rem-tac.csv
- class spacekit.preprocessor.ingest.JwstCalIngest(input_path=None, pfx='', outpath=None, save_l1=False, **log_kws)[source]¶
Loads raw JWST Calibration Pipeline metadata from local disk (
input_path) and runs initial ML preprocessing steps necessary prior to model training. The resulting dataframes will be “ingested” into any pre-existing training sets located inoutpath. This outpath acts as the primary database containing several “tables” (dataframes stored in .csv files). This class is designed to run on single or multiple files at a time (limit specificity using ‘pfx`).Input file naming convention: YYYY-MM-DD_%d.csv (%d = day of year) ex: 2024-02-21_052.csv Alternate formats currently not supported because filenames are used to store date info. Examples:
To ingest multiple files from November 2023, set
pfx="2023-11".To ingest only one file from January 3, 2024, set
pfx="2024-01-03".You can also pass in a wildcard:
pfx="*_3"would search for all data collected on days 300-365 of any year, whilepfx="2023*_3"would do the same but only for the year 2023.
The contents of raw metadata files are expected to contain:
columns consistent with Fits header keyword-values used in JWST Cal model training (see
spacekit.skopes.jwst.cal.config)
rows of Level 1/1b exposures (inputs/features) along with Level 3 products
imagesize (memory footprint) for each L3 product (outputs/target)
- Parameters:
input_path (str (path), optional) – directory path to csv files on local disk, by default None (current working directory)
pfx (str, optional) – filename start pattern (e.g. “2023” or “*-12-), by default “”
outpath (str (path), optional) – directory path to save (and/or update) preprocessed files on local disk, by default None (current working directory)
save_l1 (bool, optional) – save matched level 1 input data to separate file, by default True
- convert_imagesize_units(data=None)[source]¶
Converts the
imagesize(memory footprint) column to Gigabyte units and stores the values in a new column namedimgsize_gbfor each exp_type in theself.dataattribute (image, spec, etc). If thedatakwarg is None, this change is also applied to the raw (unencoded) versions (self.raw). Otherwise the conversion is made to the dataframe passed into thedatakwarg.- Parameters:
data (pandas.DataFrame, optional) – Apply the unit conversion to a particular dataframe instead of the default
self.data, by default None- Returns:
Dataframe with additional column ‘imgsize_gb` containing the GB values converted from
imagesizecolumn.- Return type:
pd.DataFrame
- drop_level2(df)[source]¶
Determines which
dagcolumn values relate to Level 1 and Level 3 according to their names, then drops any rows from the DataFrame that do not match these values. Note: starting on 6/13/2025, a change in the data collection process added a newdagvalue ‘ESTIMATE_LEVEL_3_MEMORY’ which is unrelated to the actual processing of a dataset on its designated server node and therefore rows matching this value are also removed.- Parameters:
df (pandas.DataFrame) – dataframe to search and modify
- Returns:
dataframe with only L1 and L3 datasets
- Return type:
pandas.DataFrame
- drop_unmatched()[source]¶
Store any unmatched inputs into the
self.rawattribute then remove them from the training set. Reports a log of the percentage of L3 products successfully matched during this ingest run (anything less than 100% indicates an error).
- extrapolate()[source]¶
Match each group of L1 input exposures to a single L3 product, then separate unmatched exposures from the dataframe and convert imagesize to gigabytes. If any L3 products remain unmatched, the preliminary assumption is that these datasets were reprocessed and an attempt is made to update the relevant features for this product within the existing training file stored on local disk at
training.csvif it exists. Warnings are reported by the log if multiple L3 products match a particular group of L1 inputs and/or L3 products remain that could not be matched with any input exposures or a previous L3 product in the existing training set. In both cases, these products are stored as a list in theself.l3attribute for further analysis and debugging since either occurrence indicates an error in the way data is being ingested (often as a result of unexpected changes made within the JWST pipeline after a given release).
- get_unencoded()[source]¶
Retrieve the raw (unencoded) L3 products generated by the JWST Scrubber using preprocessed L1 exposure groups.
- Returns:
Dictionary of each exp_type’s dataframe of raw (unencoded) L3 products generated based on groups of L1 input exposures run through the JWST Scubber.
- Return type:
- ingest_data()[source]¶
Loads all relevant files to be ingested into a single dataframe, adding columns for date, year and day of year (
doy) based on the file names to demarcate the file from which each dataset originated. Additionally, only observations relating to jwst calibration levels 1 and 3 are kept, while the rest are dropped.
- initial_scrub()[source]¶
Initial preprocessing renames and adds several columns, sets the df index to Dataset, recasts datatypes, and drops the following: - older duplicates and exposure types known to be unrelated to Level 3 processing - redundant MIRI IFU products (only 1 channel per dataset is kept) - mosaics (estimates for L3 datasets used to create a mosaic accurately reflect compute requirements)
- load_and_recast(dpath, idxcol=None)[source]¶
Loads in a dataframe from file on local disk generated by a prior ingest and recasts data types as needed for certain columns where that information is lost during a save.
- load_priors()[source]¶
Loads previously ingested but unmatched datasets from ‘ingest.csv’ file located in
output_pathon local disk. Checks theparamscolumn and extracts any that match the current ingest dataframe in order to attempt a new match. This is necessary for some datasets which take multiple days to complete processing.
- match_product_groups(exp_type)[source]¶
Matching L3 product with its associated L1 input exposures. 1. If TARGNAME: match using params (PID-OBS-OPTELEM-SUBARRAY-EXP_TYPE) + TARGNAME 2. Elif fixed target: match using params + targra (TARG_RA rounded to 6 sig. digits) 3. Else: match params + gs_mag
- Parameters:
exp_type (str) – model-based ‘exp_type’ grouping: IMAGE, SPEC, TAC, or FGS
- match_query(info, extra_param=None)[source]¶
Queries the dataframe for L3 products matching the shared metadata attributes for a group of L1 input exposures. If a value is passed into the
extra_paramkwarg, the query is further restricted to include products with a value matching this additional parameter. If this initial query returns 0 results, a second broader query without the additional param is automatically run. By default, the query attempts to find L3 products within the dataframe whoseparamscolumn value matches that of the L1 inputs’paramscolumn.- Parameters:
- Returns:
L3 products matching the specified metadata (and query parameters if requested).
- Return type:
- read_files()[source]¶
Collects a list of filenames to be ingested from local disk according to the glob pattern combining
input_pathandpfxending withcsv. A warning is issued if no files matching the pattern are found. The list of files are stored in the class attributefiles.
- recast_dtypes(df)[source]¶
When loading a saved dataframe, some datatypes need to be recast appropriately in order to be able to edit existing / insert new values.
- Parameters:
df (pandas.DataFrame) – dataframe to be recast
- Returns:
recasted dataframe
- Return type:
pandas.DataFrame
- reduce_mirifu_channels()[source]¶
Append channel info to
paramsstring; drop MIRI IFU L3 products from channels 2,4 (keep only 1, 3). Channels 1-2 use the same input exposures, and the same goes for channels 3-4. NOTE: The memory footprint for each L3 product is the same regardless of channel or subchannel (‘band’), so the inclusion of L3 products from both channels 1 and 3 is likely to be redundant for ML training purposes. The resulting metadata features will show some variability between ch1/2 and 3/4 L3 products because the input exposures are distinct. Further analysis is needed to determine if such variability simply adds noise to the training set, and a decision should be made at training time whether or not to include both channels. Adjustments to inference preprocessing may need to be made so that the model simply ignores channel/subchannel altogether and treats the entire group of inputs as pertaining to a single L3 product (jw_PID_OBS_TRG_miri_). The memory footprint estimates for each individual channel/subchannel combination can be inferred from a single inference output and applied to all relevant ‘subproducts’.
- save_ingest_data()[source]¶
Adds unmatched L1 inputs into ‘ingest.csv’, matched L3 products to ‘training.csv’. If
save_l1attribute is True, matched L1 input exposures are saved to a separate file ‘level1.csv’.
- save_training_sets()[source]¶
Adds preprocessed ML training data for each model type to its respective file on local disk:
train-{exp_type}.csv. The raw (unencoded) versions are also saved to local disk asraw-{exp_type}.csv. Any remaining L1 inputs that did not have a matching L3 product are saved torem-{exp-type}.csvprimarily for debugging purposes.
- scrub_exposures()[source]¶
Preprocess the L1 input exposures through the JWST Scrubber. See JwstCalScrubber for details.
- set_outpath(value=None)[source]¶
Initialize class variables relating to file paths on local disk where ingested data will be stored. If nothing is passed into the
valuekwarg, the default base path for outputs will be the same as inputs.- Parameters:
value (str or Path, optional) – custom path to a directory where output files will be saved, by default None
- set_params()[source]¶
Creates a new dataframe column containing a concatenated string of keywords that uniquely identify a group of related L1 inputs and their L3 output. This is used (in combination with other columns such as targ_ra/dec to match L1 exposures with their L3 product). WFSC params are generated separately.
- class spacekit.preprocessor.ingest.SvmAlignmentIngest(input_path=None, outpath=None)[source]¶
Class for ingesting and preprocessing HST single visit mosaic alignment classifier datasets
- Parameters:
- run_preprocessing(h5=None, fname='svm_data', visit=None)[source]¶
Scrapes SVM data from raw files, preprocesses dataframe for MLP classifier and generates png images for image CNN. #TODO: if no JSON files found, look for results_*.csv file instead and preprocess via alternative method
- Parameters:
input_path (str) – path to SVM dataset directory
h5 (str, optional) – load from existing hdf5 file, by default None
fname (str, optional) – base filename to give the output files, by default “svm_data”
output_path (str, optional) – where to save output files. Defaults to current working directory, by default None
json_pattern (str, optional) – glob-based search pattern, by default “_total*_svm_.json”
visit (str, optional) – single visit name (e.g. “id8f34”) matching subdirectory of input_path; will search and preprocess this visit only (rather than all visits contained in the input_path), by default None
crpt (int, optional) – set to 1 if using synthetic corruption data, by default 0
draw (int, optional) – generate png images from dataset, by default 1
- Returns:
preprocessed Pandas dataframe
- Return type:
dataframe