FastEHR.adapters.BEHRT¶
Classes¶
Convert tokenized FastEHR patient sequences into BEHRT-compatible format. |
|
Build a BEHRT-ready DataFrame from batches of token and age tensors. |
Module Contents¶
- class FastEHR.adapters.BEHRT.ConvertToBEHRT(tokenizer, supervised=False)¶
Convert tokenized FastEHR patient sequences into BEHRT-compatible format.
This adapter:
Extends an existing FastEHR tokenizer to include BEHRT’s required special tokens.
Converts sequences of events grouped by visit into the token/age format expected by BEHRT, adding [CLS] at the start and [SEP] between visits.
Retains values (despite not being used in BEHRT).
Removes baseline information (e.g. ethnicity, gender) as this is not used by BEHRT.
Attributes
special_tokens (dict[str, int]): Mapping of BEHRT special tokens to fixed IDs: PAD=0, UNK=1, SEP=2, CLS=3, MASK=4.
fastehr_tokenizer (object): Original FastEHR tokenizer instance passed at init.
supervised (bool): Whether conversion targets a supervised task (affects final SEP).
tokenizer (dict[str, int]): Token to index mapping incl. BEHRT specials and original codes.
Example:
>>> converter = ConvertToBEHRT(fastehr_tokenizer) >>> processed_list_of_patient_dicts = converter(list_of_patient_dicts)
- special_tokens¶
- create_behrt_tokenizer(tokenizer)¶
- fastehr_tokenizer¶
- supervised = False¶
- tokenizer¶
- convert_sample(data_sample: dict)¶
- class FastEHR.adapters.BEHRT.BehrtDFBuilder(token_map: dict, pad_token: int | str = 'PAD', class_token: int | str = 'CLS', sep_token: int | str = 'SEP', id_prefix: str = 'P', zfill: int = 3, min_seq_len: int = 5)¶
Build a BEHRT-ready DataFrame from batches of token and age tensors.
Each batch must be shaped [batch_size, seq_len].
Parameters¶
- token_mapdict
Mapping from token string to token id (BEHRT-modified vocab).
- pad_token, class_token, sep_tokenstr or int
Special tokens (as names or ids)
- id_prefixstr
Prefix for generated patient IDs.
- zfillint
Zero-padding length for patient IDs.
- min_seq_lenint
Minimum number of non-CLS/SEP tokens required to keep a sample. Defaults to 5 as per BEHRT paper.
- class_token_id = 'CLS'¶
- pad_token_id = 'PAD'¶
- sep_token_id = 'SEP'¶
- id_prefix = 'P'¶
- zfill = 3¶
- min_seq_len = 5¶
- rows = []¶
- next_id = 1¶
- add_batch(tokens_batch, ages_batch, target_event=None, target_time=None, target_value=None)¶
Add a batch of sequences to the builder.
- Parameters:
tokens_batch (torch.Tensor, shape
[B, T]) – Batch of token sequences; each element is a string token (or an integer ID).ages_batch (torch.Tensor, shape
[B, T]) – Ages aligned withtokens_batch.target_event (torch.Tensor or None, shape
[B]) – Outcome event token/ID for each sequence, orNone.target_time (torch.Tensor or None, shape
[B]) – Time-to-event measured from the last token intokens_batch, orNone.target_value (torch.Tensor or None, shape
[B]) – Value associated with the outcome event, orNone.
- flush() pandas.DataFrame¶
Return a DataFrame of all accumulated rows and clear the buffer. This helps manage memory when processing large datasets.