gcpy.benchmark.modules.benchmark_utils

Utility functions specific to the benchmark plotting/tabling scripts.

Functions

add_lumped_species_to_dataset(dset[, ...])

Function to calculate lumped species concentrations and add them to an xarray Dataset.

archive_lumped_species_definitions(dst)

Archives lumped species definitions to a YAML file.

archive_species_categories(dst)

Writes the list of benchmark categories to a YAML file for archival purposes.

gcc_vs_gcc_dirs(config, subdir)

Convenience function to return GCC vs.

gchp_vs_gcc_dirs(config, subdir)

Convenience function to return GCHP vs.

gchp_vs_gchp_dirs(config, subdir)

Convenience function to return GCHP vs.

get_common_varnames(refdata, devdata, prefix)

Returns an alphabetically-sorted list of common variables two xr.Dataset objects matching a given prefix.

get_datetimes_from_filenames(files)

Returns datetimes obtained from GEOS-Chem diagnostic or restart file names.

get_geoschem_level_metadata([filename, ...])

Reads a comma-separated variable (.csv) file with GEOS-Chem vertical level metadata and returns it in a pandas.DataFrame object.

get_log_filepaths(logs_dir, template, timestamps)

Returns a list of paths for GEOS-Chem log files.

get_lumped_species_definitions()

Returns lumped species definitions from a YAML file.

get_species_categories([benchmark_type])

Returns the list of benchmark categories that each species belongs to.

get_species_database_files(config, ...)

Returns the paths to the species_database.yml files in the Ref and Dev benchmark run directories.

make_output_dir(dst, collection, subdst[, ...])

Creates a subdirectory for the given collection type in the destination directory.

pdf_filename(dst, collection, subdst, plot_type)

Creates the absolute path for a PDF file containing benchmark plots.

print_benchmark_info(config)

Prints which benchmark plots and tables will be generated.

print_sigdiffs(sigdiff_files, sigdiff_list, ...)

Appends a list of species showing significant differences in a benchmark plotting category to a file.

read_ref_and_dev(ref, dev[, time_mean, ...])

Reads files from the Ref and Dev models into xarray.Dataset objects.

rename_speciesconc_to_speciesconcvv(dset)

Renames netCDF variables starting with "SpeciesConc_" (which was used prior to GEOS-Chem 14.1.0) to start with "SpeciesConcVV_".

write_sigdiff(sigdiff_list, sigdiff_cat, ...)

Appends a list of species showing significant differences in a benchmark plotting category to a file.

gcpy.benchmark.modules.benchmark_utils.make_output_dir(dst, collection, subdst, overwrite=False)[source]

Creates a subdirectory for the given collection type in the destination directory.

Parameters:
  • dst (str) – Destination directory.

  • collection (str) – e.g. “Aerosols”, “DryDep”, “Oxidants”, …

  • subdst (str or None) – e.g. “AnnualMean”, “Apr2019”, …

  • overwrite (bool, optional) – Overwrite existing directory contents?

Returns:

dst – Path of the directory that was created.

Return type:

str

gcpy.benchmark.modules.benchmark_utils.read_ref_and_dev(ref, dev, time_mean=False, multi_file=False, verbose=False)[source]

Reads files from the Ref and Dev models into xarray.Dataset objects.

Parameters:
  • ref (str or list) – Ref data file(s).

  • dev (str or list) – Dev data file(s).

  • time_mean (bool, optional) – Return the average over the time dimension?

  • multi_file (bool, optional) – Read multiple files w/o taking avg over time.

  • verbose (bool, optional) – Enable verbose output.

Returns:

  • ref_data (xr.Dataset) – Data from the Ref model.

  • dev_data (xr.Dataset) – Data from the Dev model.

gcpy.benchmark.modules.benchmark_utils.get_common_varnames(refdata, devdata, prefix, verbose=False)[source]

Returns an alphabetically-sorted list of common variables two xr.Dataset objects matching a given prefix.

Parameters:
  • refdata (xr.Dataset) – Data from the Ref model.

  • devdata (xr.Dataset) – Data from the Dev model.

  • prefix (str) – Variable prefix to match.

  • verbose (bool, optional) – Toggle verbose printout on/off.

Returns:

varlist – Sorted list of common variable names.

Return type:

list

gcpy.benchmark.modules.benchmark_utils.print_sigdiffs(sigdiff_files, sigdiff_list, sigdiff_type, sigdiff_cat)[source]

Appends a list of species showing significant differences in a benchmark plotting category to a file.

Parameters:
  • sigdiff_files (list or None) – List of files for significant diffs output.

  • sigdiff_list (list) – List of significant differences to print.

  • sigdiff_type (str) – e.g. “sfc”, “500hPa”, “zm”.

  • sigdiff_cat (str) – e.g. “Oxidants”, “Aerosols”, “DryDep”, etc.

gcpy.benchmark.modules.benchmark_utils.write_sigdiff(sigdiff_list, sigdiff_cat, sigdiff_file)[source]

Appends a list of species showing significant differences in a benchmark plotting category to a file.

Parameters:
  • sigdiff_list (list) – List of significant differences.

  • sigdiff_cat (str) – e.g. “Oxidants”, “Aerosols”, “DryDep”, etc.

  • sigdiff_file (str) – Filename to which the list will be appended.

gcpy.benchmark.modules.benchmark_utils.pdf_filename(dst, collection, subdst, plot_type)[source]

Creates the absolute path for a PDF file containing benchmark plots.

Parameters:
  • dst (str) – Root folder for benchmark output plots.

  • collection (str) – e.g. “Aerosols”, “DryDep”, etc.

  • subdst (str or None) – e.g. “AnnualMean”, “Apr2019”, etc.

  • plot_type (str) – e.g. “Surface”, “FullColumnZonalMean”, etc.

Returns:

pdf_path – Absolute path for the PDF file containing plots.

Return type:

str

gcpy.benchmark.modules.benchmark_utils.print_benchmark_info(config)[source]

Prints which benchmark plots and tables will be generated.

Parameters:

config (dict) – Inputs from the benchmark config YAML file.

gcpy.benchmark.modules.benchmark_utils.get_geoschem_level_metadata(filename=None, search_key=None, verbose=False)[source]

Reads a comma-separated variable (.csv) file with GEOS-Chem vertical level metadata and returns it in a pandas.DataFrame object.

Parameters:
  • filename (str, optional) – Name of the comma-separated variable file to read.

  • search_key (str or None, optional) – Returns metadata that matches this value.

  • verbose (bool, optional) – Toggles verbose printout on or off.

Returns:

metadata – Metadata for GEOS-Chem vertical levels.

Return type:

pd.DataFrame

gcpy.benchmark.modules.benchmark_utils.get_lumped_species_definitions()[source]

Returns lumped species definitions from a YAML file.

Returns:

lumped_spc_dict – Dictionary of lumped species.

Return type:

dict

gcpy.benchmark.modules.benchmark_utils.archive_lumped_species_definitions(dst)[source]

Archives lumped species definitions to a YAML file.

Parameters:

dst (str) – Destination folder for YAML file output.

gcpy.benchmark.modules.benchmark_utils.add_lumped_species_to_dataset(dset, lspc_dict=None, lspc_yaml='', verbose=False, overwrite=False, prefix='SpeciesConcVV_')[source]

Function to calculate lumped species concentrations and add them to an xarray Dataset. Lumped species definitions may be passed as a dictionary or a path to a yaml file. If neither is passed then the lumped species yaml file stored in gcpy is used. This file is customized for use with benchmark simulation SpeciesConc diagnostic collection output. The algorithm has been optimized by AI to improve performance.

Parameters:
  • dset (xr.Dataset) – Data prior to adding lumped species.

  • lspc_dict (dict, optional) – Species & scale factors for each lumped species.

  • lspc_yaml (str, optional) – YAML file w/ lumped species definitions.

  • verbose (bool, optional) – Toggles verbose printout on or off.

  • overwrite (bool, optional) – Overwrite existing species or raise an error.

  • prefix (str, optional) – Prefix to prepend to lumped species names.

Returns:

dset – Original species plus added lumped species.

Return type:

xr.Dataset

Notes

Key Improvements:

  1. Vectorized summation: Uses sum(to_sum) instead of incremental +=

  2. Lazy evaluation: Operations remain lazy until actual computation

  3. Single merge: Uses .assign() instead of merging many DataArrays

  4. Cleaner logic: More Pythonic dictionary iteration

Performance Impact:

  • Original: O(n_lumped × n_constituents) individual array operations

  • Optimized: O(n_lumped) vectorized operations

gcpy.benchmark.modules.benchmark_utils.get_species_categories(benchmark_type='FullChemBenchmark')[source]

Returns the list of benchmark categories that each species belongs to. This determines which PDF files will contain the plots for the various species.

Parameters:

benchmark_type (str, optional) – Specifies the type of the benchmark.

Returns:

spc_cat_dict – Dictionary of categories and sub-categories.

Return type:

dict

gcpy.benchmark.modules.benchmark_utils.archive_species_categories(dst)[source]

Writes the list of benchmark categories to a YAML file for archival purposes.

Parameters:

dst (str) – Destination folder for YAML file output.

gcpy.benchmark.modules.benchmark_utils.rename_speciesconc_to_speciesconcvv(dset)[source]

Renames netCDF variables starting with “SpeciesConc_” (which was used prior to GEOS-Chem 14.1.0) to start with “SpeciesConcVV_”. This is needed for backwards compatibility with older versions.

Parameters:

dset (xr.Dataset) – The input dataset.

Returns:

dset – The modified dataset.

Return type:

xr.Dataset

gcpy.benchmark.modules.benchmark_utils.gcc_vs_gcc_dirs(config, subdir)[source]

Convenience function to return GCC vs. GCC file paths for use in the benchmarking modules.

Parameters:
  • config (dict) – Info read from config file.

  • subdir (str) – Subdirectory.

Returns:

  • refdir (str) – File path for the Ref model.

  • devdir (str) – File path for the Dev model.

gcpy.benchmark.modules.benchmark_utils.gchp_vs_gcc_dirs(config, subdir)[source]

Convenience function to return GCHP vs. GCC file paths for use in the benchmarking modules.

Parameters:
  • config (dict) – Info read from config file.

  • subdir (str) – Subdirectory.

Returns:

  • refdir (str) – File path for the Ref model.

  • devdir (str) – File path for the Dev model.

gcpy.benchmark.modules.benchmark_utils.gchp_vs_gchp_dirs(config, subdir)[source]

Convenience function to return GCHP vs. GCHP file paths for use in the benchmarking modules.

Parameters:
  • config (dict) – Info read from config file.

  • subdir (str) – Subdirectory.

Returns:

  • refdir (str) – File path for the Ref model.

  • devdir (str) – File path for the Dev model.

gcpy.benchmark.modules.benchmark_utils.get_log_filepaths(logs_dir, template, timestamps)[source]

Returns a list of paths for GEOS-Chem log files. These are needed to compute the benchmark timing tables.

Parameters:
  • logs_dir (str) – Path to directory w/ log files.

  • template (str) – Log file template w/ “%DATE%” token.

  • timestamps (list) – List of datetimes.

Returns:

result – List of log file paths.

Return type:

list

gcpy.benchmark.modules.benchmark_utils.get_datetimes_from_filenames(files)[source]

Returns datetimes obtained from GEOS-Chem diagnostic or restart file names.

Parameters:

files (list) – GEOS-Chem diagnostic/restart file names.

Returns:

datetimes – Array of np.datetime64 values.

Return type:

np.ndarray

gcpy.benchmark.modules.benchmark_utils.get_species_database_files(config, ref_model, dev_model)[source]

Returns the paths to the species_database.yml files in the Ref and Dev benchmark run directories.

Parameters:
  • config (dict) – Benchmark configuration information.

  • ref_model (str) – Either “gcc” or “gchp”.

  • dev_model (str) – Either “gcc” or “gchp”.

Returns:

spcdb_files – Paths to the species database files corresponding to Ref & Dev simulations.

Return type:

list