FastEHR.dataloader.dataset.dataset_polars¶

Classes¶

PolarsDataset

Initialize the dataset and establish a connection to the database.

Module Contents¶

class FastEHR.dataloader.dataset.dataset_polars.PolarsDataset(path_to_db)¶

Initialize the dataset and establish a connection to the database.

save_path = None¶

path_to_db¶

collector¶

fit(path: str, practice_inclusion_conditions: list[str] | None = None, include_static: bool = True, include_diagnoses: bool = True, include_measurements: bool = True, overwrite_practice_ids: str | None = None, overwrite_meta_information: str | None = None, num_threads: int = 1, **kwargs)¶

Creates a deep-learning-friendly dataset by extracting structured data from an SQLite database.

This function loads information from SQLite tables into Polars frames in chunks and processes them into a format suitable for deep learning models. The processing includes:

Loading SQLite data into Polars frames.
Combining and aligning frames into a lazy Polars representation.
Iteratively computing normalization statistics, counts, or other meta-information.
Saving Polars frames to Parquet format.
Creating a hashmap dictionary for fast lookups (faster than native PyArrow solutions).
Splitting data into training, validation, and test sets.

Parameters¶

pathstr: Full path to the folder where the Parquet dataset, meta-information, and file lookup pickles will be stored.
practice_inclusion_conditionslist[str], optional: A list of SQL conditions to filter practices when querying the collector. Example: [“COUNTRY = ‘E’”] to include only practices from England.
include_staticbool, optional, default=True: Whether to include static information in the meta-information.
include_diagnosesbool, optional, default=True: Whether to include diagnoses in the meta-information and the Parquet dataset.
include_measurementsbool, optional, default=True: Whether to include measurements in the meta-information and the Parquet dataset.
overwrite_practice_idsstr, optional, default=False: The path for the file containing practice ID allocations for train/test/validation splits. This is useful for aligning datasets, for example, when creating a fine-tuning dataset from a Foundation Model dataset to prevent data leakage.
overwrite_meta_informationstr, optional: If provided, this should be a path to an existing meta-information pickle file. This allows skipping redundant pre-processing when using precomputed quantile bounds for some measurements.

Other Parameters¶

drop_empty_dynamicbool, default=True: Whether to remove patients with no recorded dynamic events.
drop_missing_databool, default=True: Whether to remove records with missing data.

Notes¶

The SQLite collector cannot be pickled, so the class attributes are saved separately.
Future work should aim to pickle this class instead of storing separate attributes.