gcpy.util

Internal utilities for helping to manage xarray and numpy objects used throughout GCPy

Module Contents

Functions

convert_lon(data[, dim, format, neg_dateline])

Convert longitudes from -180..180 to 0..360, or vice-versa.

get_emissions_varnames(commonvars[, template])

Will return a list of emissions diagnostic variable names that

create_display_name(diagnostic_name)

Converts a diagnostic name to a more easily digestible name

print_totals(ref, dev, f[, masks])

Computes and prints Ref and Dev totals (as well as the difference

get_species_categories([benchmark_type])

Returns the list of benchmark categories that each species

archive_species_categories(dst)

Writes the list of benchmark categories to a YAML file

add_bookmarks_to_pdf(pdfname, varlist[, ...])

Adds bookmarks to an existing PDF file.

add_nested_bookmarks_to_pdf(pdfname, category, ...[, ...])

Add nested bookmarks to PDF.

add_missing_variables(refdata, devdata[, verbose])

Compares two xarray Datasets, "Ref", and "Dev". For each variable

reshape_MAPL_CS(da)

Reshapes data if contains dimensions indicate MAPL v1.0.0+ output

get_diff_of_diffs(ref, dev)

Generate datasets containing differences between two datasets

slice_by_lev_and_time(ds, varname, itime, ilev, flip)

Slice a DataArray by desired time and level.

rename_and_flip_gchp_rst_vars(ds)

Transforms a GCHP restart dataset to match GCC names and level convention

dict_diff(dict0, dict1)

Function to take the difference of two dict objects.

compare_varnames(refdata, devdata[, refonly, devonly, ...])

Finds variables that are common to two xarray Dataset objects.

compare_stats(refdata, refstr, devdata, devstr, varname)

Prints out global statistics (array sizes, mean, min, max, sum)

convert_bpch_names_to_netcdf_names(ds[, verbose])

Function to convert the non-standard bpch diagnostic names

get_lumped_species_definitions()

Returns lumped species definitions from a YAML file.

archive_lumped_species_definitions(dst)

Archives lumped species definitions to a YAML file.

add_lumped_species_to_dataset(ds[, lspc_dict, ...])

Function to calculate lumped species concentrations and add

filter_names(names[, text])

Returns elements in a list that match a given substring.

divide_dataset_by_dataarray(ds, dr[, varlist])

Divides variables in an xarray Dataset object by a single DataArray

get_shape_of_data(data[, vertical_dim, return_dims])

Convenience routine to return a the shape (and dimensions, if

get_area_from_dataset(ds)

Convenience routine to return the area variable (which is

get_variables_from_dataset(ds, varlist)

Convenience routine to return multiple selected DataArray

create_dataarray_of_nan(name, sizes, coords, attrs[, ...])

Given an xarray DataArray dr, returns a DataArray object with

check_for_area(ds[, gcc_area_name, gchp_area_name])

Makes sure that a dataset has a surface area variable contained

get_filepath(datadir, col, date[, is_gchp, ...])

Routine to return file path for a given GEOS-Chem "Classic"

get_filepaths(datadir, collections, dates[, is_gchp, ...])

Routine to return filepaths for a given GEOS-Chem "Classic"

extract_pathnames_from_log(filename[, prefix_filter])

Returns a list of pathnames from a GEOS-Chem log file.

get_gcc_filepath(outputdir, collection, day, time)

Routine for getting filepath of GEOS-Chem Classic output

get_gchp_filepath(outputdir, collection, day, time)

Routine for getting filepath of GCHP output

get_nan_mask(data)

Create a mask with NaN values removed from an input array

all_zero_or_nan(ds)

Return whether ds is all zeros, or all nans

dataset_mean(ds[, dim, skipna])

Convenience wrapper for taking the mean of an xarray Dataset.

dataset_reader(multi_files)

Returns a function to read an xarray Dataset.

read_config_file(config_file)

Reads configuration information from a YAML file.

gcpy.util.convert_lon(data, dim='lon', format='atlantic', neg_dateline=True)

Convert longitudes from -180..180 to 0..360, or vice-versa.

Args:
data: DataArray or Dataset

The container holding the data to be converted; the dimension indicated by ‘dim’ must be associated with this container

Keyword Args (optional):
dim: str

Name of dimension holding the longitude coordinates Default value: ‘lon’

format: str

Control whether or not to shift from -180..180 to 0..360 ( (‘pacific’) or from 0..360 to -180..180 (‘atlantic’) Default value: ‘atlantic’

neg_dateline: logical

If True, then the international dateline is set to -180 instead of 180. Default value: True

Returns:

data, with dimension ‘dim’ altered according to conversion rule

gcpy.util.get_emissions_varnames(commonvars, template=None)

Will return a list of emissions diagnostic variable names that contain a particular search string.

Args:
commonvars: list of strs

A list of commmon variable names from two data sets. (This can be obtained with method gcpy.util.compare_varnames)

template: str

String template for matching variable names corresponding to emission diagnostics by sector Default Value: None

Returns:
varnames: list of strs

A list of variable names corresponding to emission diagnostics for a given species and sector

gcpy.util.create_display_name(diagnostic_name)

Converts a diagnostic name to a more easily digestible name that can be used as a plot title or in a table of totals.

Args:
diagnostic_name: str

Name of the diagnostic to be formatted

Returns:
display_name: str

Formatted name that can be used as plot titles or in tables of emissions totals.

Remarks:

Assumes that diagnostic names will start with either “Emis” (for emissions by category) or “Inv” (for emissions by inventory). This should be an OK assumption to make since this routine is specifically geared towards model benchmarking.

gcpy.util.print_totals(ref, dev, f, masks=None)

Computes and prints Ref and Dev totals (as well as the difference Dev - Ref) for two xarray DataArray objects.

Args:
ref: xarray DataArray

The first DataArray to be compared (aka “Reference”)

dev: xarray DataArray

The second DataArray to be compared (aka “Development”)

f: file

File object denoting a text file where output will be directed.

Keyword Args (optional):
masks: dict of xarray DataArray

Dictionary containing the tropospheric mask arrays for Ref and Dev. If this keyword argument is passed, then print_totals will print tropospheric totals. Default value: None (i.e. print whole-atmosphere totals)

Remarks:

This is an internal method. It is meant to be called from method create_total_emissions_table or create_global_mass_table instead of being called directly.

gcpy.util.get_species_categories(benchmark_type='FullChemBenchmark')

Returns the list of benchmark categories that each species belongs to. This determines which PDF files will contain the plots for the various species.

Args:
benchmark_type: str

Specifies the type of the benchmark (either FullChemBenchmark (default) or TransportTracersBenchmark).

Returns:
spc_cat_dict: dict

A nested dictionary of categories (and sub-categories) and the species belonging to each.

NOTE: The benchmark categories are specified in YAML file benchmark_species.yml.

gcpy.util.archive_species_categories(dst)

Writes the list of benchmark categories to a YAML file named “benchmark_species.yml”.

Args:
dst: str

Name of the folder where the YAML file containing benchmark categories (“benchmark_species.yml”) will be written.

gcpy.util.add_bookmarks_to_pdf(pdfname, varlist, remove_prefix='', verbose=False)

Adds bookmarks to an existing PDF file.

Args:
pdfname: str

Name of an existing PDF file of species or emission plots to which bookmarks will be attached.

varlist: list

List of variables, which will be used to create the PDF bookmark names.

Keyword Args (optional):
remove_prefix: str

Specifies a prefix to remove from each entry in varlist when creating bookmarks. For example, if varlist has a variable name “SpeciesConc_NO”, and you specify remove_prefix=”SpeciesConc_”, then the bookmark for that variable will be just “NO”, etc.

verbose: bool

Set this flag to True to print extra informational output. Default value: False

gcpy.util.add_nested_bookmarks_to_pdf(pdfname, category, catdict, warninglist, remove_prefix='')

Add nested bookmarks to PDF.

Args:
pdfname: str

Path of PDF to add bookmarks to

category: str

Top-level key name in catdict that maps to contents of PDF

catdict: dictionary

Dictionary containing key-value pairs where one top-level key matches category and has value fully describing pages in PDF. The value is a dictionary where keys are level 1 bookmark names, and values are lists of level 2 bookmark names, with one level 2 name per PDF page. Level 2 names must appear in catdict in the same order as in the PDF.

warninglist: list of strings

Level 2 bookmark names to skip since not present in PDF.

Keyword Args (optional):
remove_prefix: str

Prefix to be remove from warninglist names before comparing with level 2 bookmark names in catdict. Default value: empty string (warninglist names match names in catdict)

gcpy.util.add_missing_variables(refdata, devdata, verbose=False, **kwargs)

Compares two xarray Datasets, “Ref”, and “Dev”. For each variable that is present in “Ref” but not in “Dev”, a DataArray of missing values (i.e. NaN) will be added to “Dev”. Similarly, for each variable that is present in “Dev” but not in “Ref”, a DataArray of missing values will be added to “Ref”. This routine is mostly intended for benchmark purposes, so that we can represent variables that were removed from a new GEOS-Chem version by missing values in the benchmark plots. NOTE: This function assuming incoming datasets have the same sizes and dimensions, which is not true if comparing datasets with different grid resolutions or types.

Args:
refdata: xarray Dataset

The “Reference” (aka “Ref”) dataset

devdata: xarray Dataset

The “Development” (aka “Dev”) dataset

Keyword Args (optional):
verbose: bool

Toggles extra debug print output Default value: False

Returns:
refdata, devdata: xarray Datasets

The returned “Ref” and “Dev” datasets, with placeholder missing value variables added

gcpy.util.reshape_MAPL_CS(da)

Reshapes data if contains dimensions indicate MAPL v1.0.0+ output Args:

da: xarray DataArray

Data array variable

Returns:
data: xarray DataArray

Data with dimensions renamed and transposed to match old MAPL format

gcpy.util.get_diff_of_diffs(ref, dev)

Generate datasets containing differences between two datasets

Args:
ref: xarray Dataset

The “Reference” (aka “Ref”) dataset.

dev: xarray Dataset

The “Development” (aka “Dev”) dataset

Returns:
absdiffs: xarray Dataset

Dataset containing dev-ref values

fracdiffs: xarray Dataset

Dataset containing dev/ref values

gcpy.util.slice_by_lev_and_time(ds, varname, itime, ilev, flip)

Slice a DataArray by desired time and level.

Args:
ds: xarray Dataset

Dataset containing GEOS-Chem data.

varname: str

Variable name for data variable to be sliced

itime: int

Index of time by which to slice

ilev: int

Index of level by which to slice

flip: bool

Whether to flip ilev to be indexed from ground or top of atmosphere

Returns:
ds[varname]: xarray DataArray

DataArray of data variable sliced according to ilev and itime

gcpy.util.rename_and_flip_gchp_rst_vars(ds)

Transforms a GCHP restart dataset to match GCC names and level convention

Args:
ds: xarray Dataset

Dataset containing GCHP restart file data, such as variables SPC_{species}, BXHEIGHT, DELP_DRY, and TropLev, with level convention down (level 0 is top-of-atmosphere).

Returns:
ds: xarray Dataset

Dataset containing GCHP restart file data with names and level convention matching GCC restart. Variables include SpeciesRst_{species}, Met_BXHEIGHT, Met_DELPDRY, and Met_TropLev, with level convention up (level 0 is surface).

gcpy.util.dict_diff(dict0, dict1)

Function to take the difference of two dict objects. Assumes that both objects have the same keys.

Args:
dict0, dict1: dict

Dictionaries to be subtracted (dict1 - dict0)

Returns:
result: dict

Key-by-key difference of dict1 - dict0

gcpy.util.compare_varnames(refdata, devdata, refonly=None, devonly=None, quiet=False)

Finds variables that are common to two xarray Dataset objects.

Args:
refdata: xarray Dataset

The first Dataset to be compared. (This is often referred to as the “Reference” Dataset.)

devdata: xarray Dataset

The second Dataset to be compared. (This is often referred to as the “Development” Dataset.)

Keyword Args (optional):
quiet: bool

Set this flag to True if you wish to suppress printing informational output to stdout. Default value: False

Returns:
vardict: dict of lists of str

Dictionary containing several lists of variable names: Key Value —– —– commonvars List of variables that are common to

both refdata and devdata

commonvarsOther List of variables that are common

to both refdata and devdata, but do not have lat, lon, and/or level dimensions (e.g. index variables).

commonvars2D List of variables that are common to

common to refdata and devdata, and that have lat and lon dimensions, but not level.

commonvars3D List of variables that are common to

refdata and devdata, and that have lat, lon, and level dimensions.

refonly List of 2D or 3D variables that are only

present in refdata.

devonly List of 2D or 3D variables that are only

present in devdata

gcpy.util.compare_stats(refdata, refstr, devdata, devstr, varname)

Prints out global statistics (array sizes, mean, min, max, sum) from two xarray Dataset objects.

Args:
refdata: xarray Dataset

The first Dataset to be compared. (This is often referred to as the “Reference” Dataset.)

refstr: str

Label for refdata to be used in the printout

devdata: xarray Dataset

The second Dataset to be compared. (This is often referred to as the “Development” Dataset.)

devstr: str

Label for devdata to be used in the printout

varname: str

Variable name for which global statistics will be printed out.

gcpy.util.convert_bpch_names_to_netcdf_names(ds, verbose=False)

Function to convert the non-standard bpch diagnostic names to names used in the GEOS-Chem netCDF diagnostic outputs.

Args:
ds: xarray Dataset

The xarray Dataset object whose names are to be replaced.

Keyword Args (optional):
verbose: bool

Set this flag to True to print informational output. Default value: False

Returns:
ds_new: xarray Dataset

A new xarray Dataset object all of the bpch-style diagnostic names replaced by GEOS-Chem netCDF names.

Remarks:

To add more diagnostic names, edit the dictionary contained in the bpch_to_nc_names.yml.

gcpy.util.get_lumped_species_definitions()

Returns lumped species definitions from a YAML file.

Returns:
lumped_spc_dictdict of str

Dictionary of lumped species

gcpy.util.archive_lumped_species_definitions(dst)

Archives lumped species definitions to a YAML file.

Args:
dststr

Name of the folder where the YAML file containing benchmark categories (“benchmark_species.yml”) will be written.

gcpy.util.add_lumped_species_to_dataset(ds, lspc_dict={}, lspc_yaml='', verbose=False, overwrite=False, prefix='SpeciesConc_')

Function to calculate lumped species concentrations and add them to an xarray Dataset. Lumped species definitions may be passed as a dictionary or a path to a yaml file. If neither is passed then the lumped species yaml file stored in gcpy is used. This file is customized for use with benchmark simuation SpeciesConc diagnostic collection output.

Args:
ds: xarray Dataset

An xarray Dataset object prior to adding lumped species.

Keyword Args (optional):
lspc_dict: dictionary

Dictionary containing list of constituent species and their integer scale factors per lumped species. Default value: False

lspc_yaml: str

Set this flag to True to print informational output. Default value: False

verbose: bool

Whether to print informational output. Default value: False

overwrite: bool

Whether to overwrite an existing species dataarray in a dataset if it has the same name as a new lumped species. If False and overlapping names are found then the function will raise an error. Default value: False

prefix: str

Prefix to prepend to new lumped species names. This argument is also used to extract an existing dataarray in the dataset with the correct size and dimensions to use during initialization of new lumped species dataarrays. Default value: “SpeciesConc_

Returns:
ds_new: xarray Dataset

A new xarray Dataset object containing all of the original species plus new lumped species.

gcpy.util.filter_names(names, text='')

Returns elements in a list that match a given substring. Can be used in conjnction with compare_varnames to return a subset of variable names pertaining to a given diagnostic type or species.

Args:
names: list of str

Input list of names.

text: str

Target text string for restricting the search.

Returns:
filtered_names: list of str

Returns all elements of names that contains the substring specified by the “text” argument. If “text” is omitted, then the original contents of names will be returned.

gcpy.util.divide_dataset_by_dataarray(ds, dr, varlist=None)

Divides variables in an xarray Dataset object by a single DataArray object. Will also make sure that the Dataset variable attributes are preserved. This method can be useful for certain types of model diagnostics that have to be divided by a counter array. For example, local noontime J-value variables in a Dataset can be divided by the fraction of time it was local noon in each grid box, etc.

Args:
ds: xarray Dataset

The Dataset object containing variables to be divided.

dr: xarray DataArray

The DataArray object that will be used to divide the variables of ds.

Keyword Args (optional):
varlist: list of str

If passed, then only those variables of ds that are listed in varlist will be divided by dr. Otherwise, all variables of ds will be divided by dr. Default value: None

Returns:
ds_new: xarray Dataset

A new xarray Dataset object with its variables divided by dr.

gcpy.util.get_shape_of_data(data, vertical_dim='lev', return_dims=False)

Convenience routine to return a the shape (and dimensions, if requested) of an xarray Dataset, or xarray DataArray. Can also also take as input a dictionary of sizes (i.e. {‘time’: 1, ‘lev’: 72, …} from an xarray Dataset or xarray Datarray object.

Args:
data: xarray Dataset, xarray DataArray, or dict

The data for which the size is requested.

Keyword Args (optional):
vertical_dim: str

Specify the vertical dimension that you wish to return: lev or ilev. Default value: ‘lev’

return_dims: bool

Set this switch to True if you also wish to return a list of dimensions in the same order as the tuple of dimension sizes. Default value: False

Returns:
shape: tuple of int

Tuple containing the sizes of each dimension of dr in order: (time, lev|ilev, nf, lat|YDim, lon|XDim).

dims: list of str

If return_dims is True, then dims will contain a list of dimension names in the same order as shape ([‘time’, ‘lev’, ‘lat’, ‘lon’] for GEOS-Chem “Classic”,

or [‘time’, ‘lev’, ‘nf’, ‘Ydim’, ‘Xdim’] for GCHP.

gcpy.util.get_area_from_dataset(ds)

Convenience routine to return the area variable (which is usually called “AREA” for GEOS-Chem “Classic” or “Met_AREAM2” for GCHP) from an xarray Dataset object.

Args:
ds: xarray Dataset

The input dataset.

Returns:
area_m2: xarray DataArray

The surface area in m2, as found in ds.

gcpy.util.get_variables_from_dataset(ds, varlist)

Convenience routine to return multiple selected DataArray variables from an xarray Dataset. All variables must be found in the Dataset, or else an error will be raised.

Args:
ds: xarray Dataset

The input dataset.

varlist: list of str

List of DataArray variables to extract from ds.

Returns:
ds_subset: xarray Dataset

A new data set containing only the variables that were requested.

Remarks: Use this routine if you absolutely need all of the requested variables to be returned. Otherwise

gcpy.util.create_dataarray_of_nan(name, sizes, coords, attrs, vertical_dim='lev')

Given an xarray DataArray dr, returns a DataArray object with the same dimensions, coordinates, attributes, and name, but with its data set to missing values (NaN) everywhere. This is useful if you need to plot or compare two DataArray variables, and need to represent one as missing or undefined.

Args: name: str

The name for the DataArray object that will contain NaNs.

sizes: dict of int

Dictionary of the dimension names and their sizes (e.g. {‘time’: 1 ‘, ‘lev’: 72, …} that will be used to create the DataArray of NaNs. This can be obtained from an xarray Dataset as ds.sizes.

coords: dict of lists of float

Dictionary containing the coordinate variables that will be used to create the DataArray of NaNs. This can be obtained from an xarray Dataset with ds.coords.

attrs: dict of str

Dictionary containing the DataArray variable attributes (such as “units”, “long_name”, etc.). This can be obtained from an xarray Dataset with dr.attrs.

Returns: dr: xarray DataArray

The output DataArray object, which will contain NaN values everywhere. This will denote missing data.

gcpy.util.check_for_area(ds, gcc_area_name='AREA', gchp_area_name='Met_AREAM2')

Makes sure that a dataset has a surface area variable contained within it. GEOS-Chem Classic files all contain surface area as variable AREA. GCHP files do not and area must be retrieved from the met-field collection from variable Met_AREAM2. To simplify comparisons, the GCHP area name will be appended to the dataset under the GEOS-Chem “Classic” area name if it is present.

Args:
ds: xarray Dataset

The Dataset object that will be checked.

Keyword Args (optional):
gcc_area_name: str

Specifies the name of the GEOS-Chem “Classic” surface area varaible Default value: “AREA”

gchp_area_name: str

Specifies the name of the GCHP surface area variable. Default value: “Met_AREAM2”

Returns:
ds: xarray Dataset

The modified Dataset object

gcpy.util.get_filepath(datadir, col, date, is_gchp=False, gchp_format_is_legacy=False)

Routine to return file path for a given GEOS-Chem “Classic” (aka “GCC”) or GCHP diagnostic collection and date.

Args:
datadir: str

Path name of the directory containing GCC or GCHP data files.

col: str

Name of collection (e.g. Emissions, SpeciesConc, etc.) for which file path will be returned.

date: numpy.datetime64

Date for which file paths are requested.

Keyword Args (optional):
is_gchp: bool

Set this switch to True to obtain file pathnames to GCHP diagnostic data files. If False, assumes GEOS-Chem “Classic”

gchp_format_is_legacy: bool

Set this switch to True to obtain GCHP file pathnames of the legacy format for diagnostics, which do not match GC-Classic filenames. Set to False to use same format as GC-Classic.

Returns:
path: str

Pathname for the specified collection and date.

gcpy.util.get_filepaths(datadir, collections, dates, is_gchp=False, gchp_format_is_legacy=False)

Routine to return filepaths for a given GEOS-Chem “Classic” (aka “GCC”) or GCHP diagnostic collection.

Args:
datadir: str

Path name of the directory containing GCC or GCHP data files.

collections: list of str

Names of collections (e.g. Emissions, SpeciesConc, etc.) for which file paths will be returned.

dates: array of numpy.datetime64

Array of dates for which file paths are requested.

Keyword Args (optional):
is_gchp: bool

Set this switch to True to obtain file pathnames to GCHP diagnostic data files. If False, assumes GEOS-Chem “Classic”

gchp_format_is_legacy: bool

Set this switch to True to obtain GCHP file pathnames of the legacy format for diagnostics, which do not match GC-Classic filenames. Set to False to use same format as GC-Classic.

Returns:
paths: 2D list of str

A list of pathnames for each specified collection and date. First dimension is collection, and second is date.

gcpy.util.extract_pathnames_from_log(filename, prefix_filter='')

Returns a list of pathnames from a GEOS-Chem log file. This can be used to get a list of files that should be downloaded from gcgrid or from Amazon S3.

Args:
filename: str

GEOS-Chem standard log file

prefix_filter (optional): str

Restricts the output to file paths starting with this prefix (e.g. “/home/ubuntu/ExtData/HEMCO/”) Default value: ‘’

Returns:
data list: list of str

List of full pathnames of data files found in the log file.

Author:

Jiawei Zhuang (jiaweizhuang@g.harvard.edu)

gcpy.util.get_gcc_filepath(outputdir, collection, day, time)

Routine for getting filepath of GEOS-Chem Classic output

Args:
outputdir: str

Path of the OutputDir directory

collection: str

Name of output collection, e.g. Emissions or SpeciesConc

day: str

Number day of output, e.g. 31

time: str

Z time of output, e.g. 1200z

Returns:
filepath: str

Path of requested file

gcpy.util.get_gchp_filepath(outputdir, collection, day, time)

Routine for getting filepath of GCHP output

Args:
outputdir: str

Path of the OutputDir directory

collection: str

Name of output collection, e.g. Emissions or SpeciesConc

day: str

Number day of output, e.g. 31

time: str

Z time of output, e.g. 1200z

Returns:
filepath: str

Path of requested file

gcpy.util.get_nan_mask(data)

Create a mask with NaN values removed from an input array

Args:
data: numpy array

Input array possibly containing NaNs

Returns:
new_data: numpy array

Original array with NaN values removed

gcpy.util.all_zero_or_nan(ds)

Return whether ds is all zeros, or all nans

Args:
ds: numpy array

Input GEOS-Chem data

Returns:
all_zero, all_nan: bool, bool

All_zero is whether ds is all zeros, all_nan is whether ds i s all NaNs

gcpy.util.dataset_mean(ds, dim='time', skipna=True)

Convenience wrapper for taking the mean of an xarray Dataset.

Args:
dsxarray Dataset

Input data

Keyword Args:
dimstr

Dimension over which the mean will be taken. Default: “time”

skipnabool

Flag to omit missing values from the mean. Default: True

Returns:
ds_meanxarray Dataset or None

Dataset containing mean values Will return None if ds is not defined

gcpy.util.dataset_reader(multi_files)

Returns a function to read an xarray Dataset.

Args:
multi_filesbool

Denotes whether we will be reading multiple files into an xarray Dataset. Default value: False

Returns:

reader : either xr.open_mfdataset or xr.open_dataset

gcpy.util.read_config_file(config_file)

Reads configuration information from a YAML file.