OPERA Paper Review

How to implement/modify opera_beam_search()
- python -c "import transformers; print(transformers.file)" you may check the location of utils.py file in transformers package
- —> /home/users/bonwoo/.conda/envs/a-llm4rec/lib/python3.10/site-packages/transformers/init.py
It is important to assign the key_position parameter in input parameter as this determines which range of input tokens are considered within the local window of cross attention!
Abstract
Hallucination often relates to knowledge aggregation patterns manifested in the self-attention matrix (LLM only focuses on few summary tokens not all previous tokens)
OPERA introduces a penalty term on model logits during beam-search decoding along with rollback strategy to retrospect the presence of summary tokens in the previously generated tokens
Introduction
- Recurring pattern of hallucination after a columnar attention pattern
  - especially on tokens that lack substantial information like stop or quotation marks.
  - Period or Quotation mark is important for predicting subsequent tokens but it exhibits a columnar attention
- Some tokens serve as summary tokens (often called as anchor token) which let LLM to aggregate previous info on a few anchor tokens at shallow layers and predict next token based on these anchors at deep layers
- In MLLM, vision tokens are inputted first but vision info diminishes during the transmission of information between summary tokens
  - Summary tokens가 많아질수록 More Hallucinations occur
- By giving over-trust penalty from OPERA model, candidate from Beam Search with over-trust pattern will unlikely to be selected.
  - Over-trust degree between in-window tokens and summary tokens is measured by local window segment in attention map
  - This incorporates with model logits predicted for next token in Beam Search
- Rollback strategy: retrospection is triggered when the location overlap of the maximum of in-window penalty scores reaches a threshold.
Method
- MLLM Input formulation
  - visual token made by vision encoder and map into LLM space, Length: N (0~N-1)
  - text token generated by tokenizer (N ~ M+N-1)
- MLLM Model Forward
  - Trained in an autoregressive manner and casual attention mask
  - output hidden states of last layer h is inputted to vocabulary head H to project h and get the logits for next token prediction
- MLLM Decoding
  - With a given beam size N_beam, the Beam Search keeps N_beam candidate sequences, where each candidate is a decoded sequence x^N_beam with a beam score. When decoding token x_t, each candidate hypothesis will select N_beam candidate tokens based on the Top-Nbeam probabilities in the logits.
- Over-Trust Logit Penalty
  - Knowledge aggregation patterns which makes hallucination has hysteresis (patterns cannot be immediately observed when the corresponding token is decoded, but after several tokens, hallucination may already occurred.)
  - We use “accumulated penalty weighted in the beam score”
  - We consider to gather all of previous self-attention weights in a local window for characterizing the knowledge pattern
    - our window does not involve the attention weights of image tokens or prompt tokens because we only focus on knowledge aggregation patterns on generated tokens
    - select the max weight of attention head and renormalize
  - Do preprocessing of filling the upper triangle of the matrix with zeros and scaling up the attention values
  - Then, conduct the column wise multiplication on the lower triangle of attention matrix and obtain a vector of column-wise scores
  - Lastly, choose the top N_can in the logit of each beam to consist candidat set y
    - y = N_can * N_beam
    - In this way,
- Retrospection-Allocation Strategy
  - Motivation: while aggressive idea is that the pattern will be greatly weakened if we could exclude the tokens that lead to hallucination and re-choose the proper first few tokens after the summary token
  - When decoding procedure encounters knowledge aggregation pattern, it rolls back to the summary token and selects other candidates for the next token prediction except for the candidates selected before
  - 그림에서 aggregation pattern이 연속적으로 발생하는 횟수가 r 보다 크면 ? retrospect 한다? or 토큰 순서가 지나도 계속 같은 자리에서 aggregation pattern이 일어난다?
  - Retrospect and Roll back mean when decoding procedure encounter the max overlap location at x_s, it goes back to x_0~x_s and select the new x_s+1 in the complementary set.
    - rollback location s가 뒤로 가지 않도록 monotonic 하게 설정
    - maximum time B를 설정하여 무한 roll back 방지

OPERA Beam Search