FastEHR.dataloader.dataset.dataset_polars ========================================= .. py:module:: FastEHR.dataloader.dataset.dataset_polars Classes ------- .. autoapisummary:: FastEHR.dataloader.dataset.dataset_polars.PolarsDataset Module Contents --------------- .. py:class:: PolarsDataset(path_to_db) Initialize the dataset and establish a connection to the database. .. py:attribute:: save_path :value: None .. py:attribute:: path_to_db .. py:attribute:: collector .. py:method:: fit(path: str, practice_inclusion_conditions: Optional[list[str]] = None, include_static: bool = True, include_diagnoses: bool = True, include_measurements: bool = True, overwrite_practice_ids: Optional[str] = None, overwrite_meta_information: Optional[str] = None, num_threads: int = 1, **kwargs) Creates a deep-learning-friendly dataset by extracting structured data from an SQLite database. This function loads information from SQLite tables into Polars frames in chunks and processes them into a format suitable for deep learning models. The processing includes: - Loading SQLite data into Polars frames. - Combining and aligning frames into a lazy Polars representation. - Iteratively computing normalization statistics, counts, or other meta-information. - Saving Polars frames to Parquet format. - Creating a hashmap dictionary for fast lookups (faster than native PyArrow solutions). - Splitting data into training, validation, and test sets. Parameters ---------- path : str Full path to the folder where the Parquet dataset, meta-information, and file lookup pickles will be stored. practice_inclusion_conditions : list[str], optional A list of SQL conditions to filter practices when querying the collector. Example: `["COUNTRY = 'E'"]` to include only practices from England. include_static : bool, optional, default=True Whether to include static information in the meta-information. include_diagnoses : bool, optional, default=True Whether to include diagnoses in the meta-information and the Parquet dataset. include_measurements : bool, optional, default=True Whether to include measurements in the meta-information and the Parquet dataset. overwrite_practice_ids : str, optional, default=False The path for the file containing practice ID allocations for train/test/validation splits. This is useful for aligning datasets, for example, when creating a fine-tuning dataset from a Foundation Model dataset to prevent data leakage. overwrite_meta_information : str, optional If provided, this should be a path to an existing meta-information pickle file. This allows skipping redundant pre-processing when using precomputed quantile bounds for some measurements. Other Parameters ---------------- drop_empty_dynamic : bool, default=True Whether to remove patients with no recorded dynamic events. drop_missing_data : bool, default=True Whether to remove records with missing data. Notes ----- - The SQLite collector **cannot be pickled**, so the class attributes are saved separately. - Future work should aim to pickle this class instead of storing separate attributes.