mxtaltools.dataset_utils.dataset_manager

class mxtaltools.dataset_utils.dataset_manager.DataManager(datasets_path, device='cpu', mode='standard', chunks_path=None, seed=0, config=None, dataset_type=None, do_crystal_indexing=True)[source]

Bases: object

assign_regression_targets()[source]
compute_edges(conv_cutoff: float, buffer: float | None = 0.5)[source]
compute_filter(condition)[source]

apply given filters for atoms & molecules with potential Z’>1, need to adjust formatting a bit

dataset_filtration(filter_conditions, filter_duplicate_molecules, filter_polymorphs)[source]
extract_misc_stats_and_indices(dataset)[source]
filter_duplicate_molecules()[source]

find duplicate examples and pick one representative per molecule :return:

filter_polymorphs()[source]

find duplicate examples and pick one representative per molecule :return:

filter_protons()[source]
generate_mol2crystal_mapping()[source]

some crystals have multiple molecules, and we do batch analysis of molecules with a separate indexing scheme connect the crystal identifier-wise and mol-wise indexing with the following dicts

get_condition_values(condition_key)[source]
get_data_dimensions()[source]
get_dataset_filter_inds(filter_conditions)[source]

identify indices not passing certain filter conditions conditions in the format [column_name, condition_type, [min, max] or [set]] condition_type in [‘range’,’in’,’not_in’]

get_identifier_duplicates()[source]

by CSD identifier CSD entries with numbers on the end are subsequent additions to the same crystal often polymorphs or repeat measurements

option for grouping identifier by blind test sample submission

get_reduced_volume_fraction()[source]
get_target()[source]
identify_unique_molecules_in_crystals()[source]

identify all exactly unique molecules (up to mol fingerprint) list their dataset indices in a dict

at train time, we can use this to repeat sampling of identical molecules

init_atom_properties()[source]
load_chunks(chunks_patterns=None, max_chunks=100000000.0, subsamples_per_chunk=100000000.0)[source]
load_dataset_for_modelling(dataset_name, override_length=None, filter_conditions=None, filter_polymorphs=False, filter_duplicate_molecules=False, filter_protons=False, conv_cutoff: float | None = None, do_shuffle: bool = True, precompute_edges: bool = False, single_identifier=None)[source]
Parameters:
  • precompute_edges (bool)

  • do_shuffle

  • conv_cutoff (float)

  • dataset_name

  • override_length

  • filter_conditions

  • filter_polymorphs

  • filter_duplicate_molecules

  • filter_protons

load_training_dataset(dataset_name)[source]
molecule_cluster_dataset_processing(dataset_name)[source]
molecule_cluster_edge_indexing(conv_cutoff)[source]

prepopulate edge information - expensive to do repeatedly - will not work if we noise the coordinates

process_new_dataset(new_dataset_name: str = None, test_dataset_size: int = 10000, max_chunks: int = 100000000.0, chunks_patterns: list = None, samples_per_chunk=100000000.0, build_stats: bool = True, save_dataset=True)[source]
rebuild_crystal_indices()[source]
remove_zpg1_info()[source]
truncate_and_shuffle_dataset(override_length=None, do_shuffle=True)[source]

defines train/test split as well as overall dataset size