mxtaltools.dataset_utils.dataset_manager
- class mxtaltools.dataset_utils.dataset_manager.DataManager(datasets_path, device='cpu', mode='standard', chunks_path=None, seed=0, config=None, dataset_type=None, do_crystal_indexing=True)[source]
Bases:
object- compute_filter(condition)[source]
apply given filters for atoms & molecules with potential Z’>1, need to adjust formatting a bit
- filter_duplicate_molecules()[source]
find duplicate examples and pick one representative per molecule :return:
- filter_polymorphs()[source]
find duplicate examples and pick one representative per molecule :return:
- generate_mol2crystal_mapping()[source]
some crystals have multiple molecules, and we do batch analysis of molecules with a separate indexing scheme connect the crystal identifier-wise and mol-wise indexing with the following dicts
- get_dataset_filter_inds(filter_conditions)[source]
identify indices not passing certain filter conditions conditions in the format [column_name, condition_type, [min, max] or [set]] condition_type in [‘range’,’in’,’not_in’]
- get_identifier_duplicates()[source]
by CSD identifier CSD entries with numbers on the end are subsequent additions to the same crystal often polymorphs or repeat measurements
option for grouping identifier by blind test sample submission
- identify_unique_molecules_in_crystals()[source]
identify all exactly unique molecules (up to mol fingerprint) list their dataset indices in a dict
at train time, we can use this to repeat sampling of identical molecules
- load_chunks(chunks_patterns=None, max_chunks=100000000.0, subsamples_per_chunk=100000000.0)[source]
- load_dataset_for_modelling(dataset_name, override_length=None, filter_conditions=None, filter_polymorphs=False, filter_duplicate_molecules=False, filter_protons=False, conv_cutoff: float | None = None, do_shuffle: bool = True, precompute_edges: bool = False, single_identifier=None)[source]
- molecule_cluster_edge_indexing(conv_cutoff)[source]
prepopulate edge information - expensive to do repeatedly - will not work if we noise the coordinates