FastEHR.adapters.BEHRT ====================== .. py:module:: FastEHR.adapters.BEHRT Classes ------- .. autoapisummary:: FastEHR.adapters.BEHRT.ConvertToBEHRT FastEHR.adapters.BEHRT.BehrtDFBuilder Module Contents --------------- .. py:class:: ConvertToBEHRT(tokenizer, supervised=False) Convert tokenized FastEHR patient sequences into BEHRT-compatible format. This adapter: - Extends an existing FastEHR tokenizer to include BEHRT's required special tokens. - Converts sequences of events grouped by visit into the token/age format expected by BEHRT, adding `[CLS]` at the start and `[SEP]` between visits. - Retains values (despite not being used in BEHRT). - Removes baseline information (e.g. ethnicity, gender) as this is not used by BEHRT. Attributes - **special_tokens** (dict[str, int]): Mapping of BEHRT special tokens to fixed IDs: PAD=0, UNK=1, SEP=2, CLS=3, MASK=4. - **fastehr_tokenizer** (object): Original FastEHR tokenizer instance passed at init. - **supervised** (bool): Whether conversion targets a supervised task (affects final SEP). - **tokenizer** (dict[str, int]): Token to index mapping incl. BEHRT specials and original codes. Example:: >>> converter = ConvertToBEHRT(fastehr_tokenizer) >>> processed_list_of_patient_dicts = converter(list_of_patient_dicts) .. py:attribute:: special_tokens .. py:method:: create_behrt_tokenizer(tokenizer) .. py:attribute:: fastehr_tokenizer .. py:attribute:: supervised :value: False .. py:attribute:: tokenizer .. py:method:: convert_sample(data_sample: dict) .. py:class:: BehrtDFBuilder(token_map: dict, pad_token: Union[int, str] = 'PAD', class_token: Union[int, str] = 'CLS', sep_token: Union[int, str] = 'SEP', id_prefix: str = 'P', zfill: int = 3, min_seq_len: int = 5) Build a BEHRT-ready DataFrame from batches of token and age tensors. Each batch must be shaped [batch_size, seq_len]. Parameters ---------- token_map : dict Mapping from token string to token id (BEHRT-modified vocab). pad_token, class_token, sep_token : str or int Special tokens (as names or ids) id_prefix : str Prefix for generated patient IDs. zfill : int Zero-padding length for patient IDs. min_seq_len : int Minimum number of non-CLS/SEP tokens required to keep a sample. Defaults to 5 as per BEHRT paper. .. py:attribute:: class_token_id :value: 'CLS' .. py:attribute:: pad_token_id :value: 'PAD' .. py:attribute:: sep_token_id :value: 'SEP' .. py:attribute:: id_prefix :value: 'P' .. py:attribute:: zfill :value: 3 .. py:attribute:: min_seq_len :value: 5 .. py:attribute:: rows :value: [] .. py:attribute:: next_id :value: 1 .. py:method:: add_batch(tokens_batch, ages_batch, target_event=None, target_time=None, target_value=None) Add a batch of sequences to the builder. :param tokens_batch: Batch of token sequences; each element is a string token (or an integer ID). :type tokens_batch: torch.Tensor, shape ``[B, T]`` :param ages_batch: Ages aligned with ``tokens_batch``. :type ages_batch: torch.Tensor, shape ``[B, T]`` :param target_event: Outcome event token/ID for each sequence, or ``None``. :type target_event: torch.Tensor or None, shape ``[B]`` :param target_time: Time-to-event measured from the last token in ``tokens_batch``, or ``None``. :type target_time: torch.Tensor or None, shape ``[B]`` :param target_value: Value associated with the outcome event, or ``None``. :type target_value: torch.Tensor or None, shape ``[B]`` .. py:method:: flush() -> pandas.DataFrame Return a DataFrame of all accumulated rows and clear the buffer. This helps manage memory when processing large datasets.