FastEHR.adapters.BEHRT
======================

.. py:module:: FastEHR.adapters.BEHRT


Classes
-------

.. autoapisummary::

   FastEHR.adapters.BEHRT.ConvertToBEHRT
   FastEHR.adapters.BEHRT.BehrtDFBuilder


Module Contents
---------------

.. py:class:: ConvertToBEHRT(tokenizer, supervised=False)


   Convert tokenized FastEHR patient sequences into BEHRT-compatible format.

   This adapter:

   - Extends an existing FastEHR tokenizer to include BEHRT's required special tokens.
   - Converts sequences of events grouped by visit into the token/age format
     expected by BEHRT, adding `[CLS]` at the start and `[SEP]` between visits.
   - Retains values (despite not being used in BEHRT).
   - Removes baseline information (e.g. ethnicity, gender) as this is not used by BEHRT.

   Attributes

   - **special_tokens** (dict[str, int]): Mapping of BEHRT special tokens to fixed IDs:
     PAD=0, UNK=1, SEP=2, CLS=3, MASK=4.
   - **fastehr_tokenizer** (object): Original FastEHR tokenizer instance passed at init.
   - **supervised** (bool): Whether conversion targets a supervised task (affects final SEP).
   - **tokenizer** (dict[str, int]): Token to index mapping incl. BEHRT specials and original codes.


   Example::

       >>> converter = ConvertToBEHRT(fastehr_tokenizer)
       >>> processed_list_of_patient_dicts = converter(list_of_patient_dicts)


   .. py:attribute:: special_tokens


   .. py:method:: create_behrt_tokenizer(tokenizer)


   .. py:attribute:: fastehr_tokenizer


   .. py:attribute:: supervised
      :value: False


   .. py:attribute:: tokenizer


   .. py:method:: convert_sample(data_sample: dict)


.. py:class:: BehrtDFBuilder(token_map: dict, pad_token: Union[int, str] = 'PAD', class_token: Union[int, str] = 'CLS', sep_token: Union[int, str] = 'SEP', id_prefix: str = 'P', zfill: int = 3, min_seq_len: int = 5)

   Build a BEHRT-ready DataFrame from batches of token and age tensors.

   Each batch must be shaped [batch_size, seq_len].

   Parameters
   ----------
   token_map : dict
       Mapping from token string to token id (BEHRT-modified vocab).
   pad_token, class_token, sep_token : str or int
       Special tokens (as names or ids)
   id_prefix : str
       Prefix for generated patient IDs.
   zfill : int
       Zero-padding length for patient IDs.
   min_seq_len : int
       Minimum number of non-CLS/SEP tokens required to keep a sample.
       Defaults to 5 as per BEHRT paper.


   .. py:attribute:: class_token_id
      :value: 'CLS'


   .. py:attribute:: pad_token_id
      :value: 'PAD'


   .. py:attribute:: sep_token_id
      :value: 'SEP'


   .. py:attribute:: id_prefix
      :value: 'P'


   .. py:attribute:: zfill
      :value: 3


   .. py:attribute:: min_seq_len
      :value: 5


   .. py:attribute:: rows
      :value: []


   .. py:attribute:: next_id
      :value: 1


   .. py:method:: add_batch(tokens_batch, ages_batch, target_event=None, target_time=None, target_value=None)

      Add a batch of sequences to the builder.

      :param tokens_batch: Batch of token sequences; each element is a string token
          (or an integer ID).
      :type tokens_batch: torch.Tensor, shape ``[B, T]``
      :param ages_batch: Ages aligned with ``tokens_batch``.
      :type ages_batch: torch.Tensor, shape ``[B, T]``
      :param target_event: Outcome event token/ID for each sequence, or ``None``.
      :type target_event: torch.Tensor or None, shape ``[B]``
      :param target_time: Time-to-event measured from the last token in ``tokens_batch``,
           or ``None``.
      :type target_time: torch.Tensor or None, shape ``[B]``
      :param target_value: Value associated with the outcome event, or ``None``.
      :type target_value: torch.Tensor or None, shape ``[B]``


   .. py:method:: flush() -> pandas.DataFrame

      Return a DataFrame of all accumulated rows and clear the buffer.
      This helps manage memory when processing large datasets.