FastEHR.dataloader.dataset.dataset_polars¶
Classes¶
Initialize the dataset and establish a connection to the database. |
Module Contents¶
- class FastEHR.dataloader.dataset.dataset_polars.PolarsDataset(path_to_db)¶
Initialize the dataset and establish a connection to the database.
- save_path = None¶
- path_to_db¶
- collector¶
- fit(path: str, practice_inclusion_conditions: list[str] | None = None, include_static: bool = True, include_diagnoses: bool = True, include_measurements: bool = True, overwrite_practice_ids: str | None = None, overwrite_meta_information: str | None = None, num_threads: int = 1, **kwargs)¶
Creates a deep-learning-friendly dataset by extracting structured data from an SQLite database.
This function loads information from SQLite tables into Polars frames in chunks and processes them into a format suitable for deep learning models. The processing includes:
Loading SQLite data into Polars frames.
Combining and aligning frames into a lazy Polars representation.
Iteratively computing normalization statistics, counts, or other meta-information.
Saving Polars frames to Parquet format.
Creating a hashmap dictionary for fast lookups (faster than native PyArrow solutions).
Splitting data into training, validation, and test sets.
Parameters¶
- pathstr
Full path to the folder where the Parquet dataset, meta-information, and file lookup pickles will be stored.
- practice_inclusion_conditionslist[str], optional
A list of SQL conditions to filter practices when querying the collector. Example: [“COUNTRY = ‘E’”] to include only practices from England.
- include_staticbool, optional, default=True
Whether to include static information in the meta-information.
- include_diagnosesbool, optional, default=True
Whether to include diagnoses in the meta-information and the Parquet dataset.
- include_measurementsbool, optional, default=True
Whether to include measurements in the meta-information and the Parquet dataset.
- overwrite_practice_idsstr, optional, default=False
The path for the file containing practice ID allocations for train/test/validation splits. This is useful for aligning datasets, for example, when creating a fine-tuning dataset from a Foundation Model dataset to prevent data leakage.
- overwrite_meta_informationstr, optional
If provided, this should be a path to an existing meta-information pickle file. This allows skipping redundant pre-processing when using precomputed quantile bounds for some measurements.
Other Parameters¶
- drop_empty_dynamicbool, default=True
Whether to remove patients with no recorded dynamic events.
- drop_missing_databool, default=True
Whether to remove records with missing data.
Notes¶
The SQLite collector cannot be pickled, so the class attributes are saved separately.
Future work should aim to pickle this class instead of storing separate attributes.