FastEHR.dataloader.dataset.dataset_polars

Classes

PolarsDataset

Initialize the dataset and establish a connection to the database.

Module Contents

class FastEHR.dataloader.dataset.dataset_polars.PolarsDataset(path_to_db)

Initialize the dataset and establish a connection to the database.

save_path = None
path_to_db
collector
fit(path: str, practice_inclusion_conditions: list[str] | None = None, include_static: bool = True, include_diagnoses: bool = True, include_measurements: bool = True, overwrite_practice_ids: str | None = None, overwrite_meta_information: str | None = None, num_threads: int = 1, **kwargs)

Creates a deep-learning-friendly dataset by extracting structured data from an SQLite database.

This function loads information from SQLite tables into Polars frames in chunks and processes them into a format suitable for deep learning models. The processing includes:

  • Loading SQLite data into Polars frames.

  • Combining and aligning frames into a lazy Polars representation.

  • Iteratively computing normalization statistics, counts, or other meta-information.

  • Saving Polars frames to Parquet format.

  • Creating a hashmap dictionary for fast lookups (faster than native PyArrow solutions).

  • Splitting data into training, validation, and test sets.

Parameters

pathstr

Full path to the folder where the Parquet dataset, meta-information, and file lookup pickles will be stored.

practice_inclusion_conditionslist[str], optional

A list of SQL conditions to filter practices when querying the collector. Example: [“COUNTRY = ‘E’”] to include only practices from England.

include_staticbool, optional, default=True

Whether to include static information in the meta-information.

include_diagnosesbool, optional, default=True

Whether to include diagnoses in the meta-information and the Parquet dataset.

include_measurementsbool, optional, default=True

Whether to include measurements in the meta-information and the Parquet dataset.

overwrite_practice_idsstr, optional, default=False

The path for the file containing practice ID allocations for train/test/validation splits. This is useful for aligning datasets, for example, when creating a fine-tuning dataset from a Foundation Model dataset to prevent data leakage.

overwrite_meta_informationstr, optional

If provided, this should be a path to an existing meta-information pickle file. This allows skipping redundant pre-processing when using precomputed quantile bounds for some measurements.

Other Parameters

drop_empty_dynamicbool, default=True

Whether to remove patients with no recorded dynamic events.

drop_missing_databool, default=True

Whether to remove records with missing data.

Notes

  • The SQLite collector cannot be pickled, so the class attributes are saved separately.

  • Future work should aim to pickle this class instead of storing separate attributes.