LLM based RecSys Paper Review

Paper	Data	Method	Summary	Limitation
CLLM4Rec		1. soft(user/item) + hard(vocab) prompting strategy

novel mutual regularization strategy
fine-tuning strategy - pretrained backbone에 multinomial likelihood를 add하여 Autoregressive하게 multiple items을 recommend 하게 함
Stochastic reordering strategy to ignore the order of item tokens (interaction history) without negative influence on vocab tokens(textual features) | Instead of using user/item embeddings —> NLP prompt for target user, interacted items, etc. Query —> info retrieved as textual output from LLM (zero shot) First, extend the vocab of pretrained LLMs with user/item ID tokens to model semantics (embeddings learned in 2 stages)

Framework: Alpaca tuning: employ self-instruct data made available by Alpaca. Utilize conditional language modeling objective = maximizing log probability of instruction output given instruction input. Lightweight Tuning: Alternative to intensive and heavy fine tuning. Can achieve comparable performance by tuning only small sub set of parameters LoRA = freezing the pretrained model params and introducing trainable rank decomposition matrices into each layer of the Transformer architecture

Backbone selection: LLaMA | Instruction Tuning: Step1: Define a task and articulate instruction with NLP Step2: Formulate and construct the input and output of task Step3: Integrate Task Instruction and Task Input to form Instruction Input and take Task output as Instruction output Step4: Tune LLMs based on pairs of Instruction Input and Instruction Output Rec-tuning Task Formulation: Categorize task input into two type: users’ liked items and disliked items Task Output as Yes or No with target new item that user has never seen —> rec tuning 자체가 fine tuning 기법이라는 것일까? 아님 rec dataset으로 embedding fine tuning을 하고 추가적으로 instruction tuning을 하는 것일까? = alpaca tuning을 하고 rec tuning을 하는 것이긴 함 | | | ID vs Modality | MIND: new clicks dataset by Microsoft HM: clothing purchase dataset by H&M Bili: comment dataset from online video rec platform | pretrained modality encoders: BERT, VIT IDRec beats MoRec in the past and had great advantage in warm start scenario Experiments for items recommendation with text and vision modality. | RQ

IDRec vs MoRec —> two-tower based DSSM and session based SASRec 으로 실험
Can NLP, Vision help accuracty improvement in MoRec? —> NLP, Vision이 성능 향상을 보여주는 실험
how to utilize item modality representation —> Two stage paradigm (Extract modality features and add to the recommendation model)
key challenges that MoRec needs to improve

In MoRec, modality encoder generate the representation of raw modality feature and use it to replace the ID embedding vector in IDRec | | | CoLLM | | Question: A user has given high ratings to the following items: <HisItemTitleList>. Additionally, we have information about the user’s preferences encoded in the feature <UserID>. Using all available information, make a prediction about whether the user would enjoy the item titled <TargetItemTitle> with the feature <TargetItemID>? Answer with “Yes” or “No”. #Answer: | When constructing prompts, we add user/item ID fields in addition to text descriptions to represent collaborative information

When encoding prompts, alongside the LLMs’ tokenization and embedding for encoding textual information, we employ a conventional collaborative model to generate user/item representations that capture collaborative information, and map them into the token embedding space of the LLM.

Prompt Construction: we describe items using their titles and describe users by the item titles from their historical interactions.

In this template, “<HisItemTitleList>” represents a list of item titles that a user has interacted with, ordered by their interaction timestamps, serving as textual descriptions of the user’s preferences. “<TargetItemTitle>” refers to the title of the target item to be predicted. The “<UserID>” and “<TargetItemID>” fields are utilized to incorporate user and item IDs, respectively, for injecting collaborative information.

**We treat user/item IDs as a type of feature for users/items within the prompt.

Hybrid Encoding:** Prompt에 있는 userid 와 targetItemID 를 다시 conventional collaborative recommender에 넣을 수 있게 Collaborative Information Encoding을 적용함.

Tokenization result for prompt [t1,t2,…,tk, userID, t_k+1, …, targetItemID, …, t_K] 이거를 [e_t1, e_t2, …e_u, …, e_i, …e_tK] 로 encode token embeddings를 d dimension으로 표현 using Embedding lookup e_u, e_i는 CIE module 사용 MF 와 같은 u, i matrix와 mapping layer g( . ) 를 사용하여 extract collaborative information for LLM usage. | | | LLaRA | | LLaRA represents items in LLM’s input prompts using a novel hybrid approach that integrates ID-based item embeddings from traditional recommenders with textual item features.

Instead of directly exposing the hybrid prompt to LLMs, we apply a curriculum learning approach to gradually ramp up training complexity.

We first warm up the LLM with text-only prompting, which aligns more naturally with the LLM’s language modeling capabilities.
Then, we progressively transition to hybrid prompting, training the adapter to incorporate behavioral knowledge from the traditional sequential recommender into the LLM. | Item Representation: item i 에 대해 conventional recommender model 의 item embedding 사용 Title,Descriptions: LLM tokenizer 씀

item embedding 을 바로 쓰기에는 LLM token들과 modality 차이 가 있으니까 MLP layer projec 한 behaviour token으로 변환해서 prompt에 사용. 그래서 text token for item + behaviour token으로 사용ex. Titanic [emb_14], Roman Holiday [emb_20] | | | ReLLa | | 기존 pretrained LLM token 제한에 걸리지 않음에도 불구하고 fails to extract useful information from a textual context of long user behaviour sequence.

Full shot for movie lens 보다 few-shot ReLLa movien lens가 outperform

HARD PROMPT ONLY: descriptive text에서 information is retrieved. Target item is determined and ask LLM if the user likes or not by YES or NO.

ZERO SHOT: No CANDIDATE POOL | MAIN IDEA: Textual context of long user behaviour sequence에서 유의미한 user behaviour를 뽑기 위해 **semantic user behaviour retrieval (SUBR)**를 사용. (instead of just truncating top-k recent behaviours)

For few shot, along with SUBR, retrieval enhanced instruction tuning to further promote LLM to handle with long behaviour sequence.

This paper focuses on the click-through rate prediction Pointwise Scoring: LLM에서 나온 estimated score s_i를 가지고 softmax를 통과시켜서 y_i를 구함 —> Yes:1, No:0

SUBR: ”Here is a movie. Its title is Toy Story(1995). The movie’s genre is Animation” 와 같은 item text 를 LLM에 주입! —> average pooling over all hidden states from the last layer of LLM —> PCA를 이용해서 final semantic representation 으로 축소 (d=512) —> each pair of item에 대한 semantic relevance를 cosine similarity로 구함

이 semantic relevance 데이터를 가지고 target item과 높은 유사도를 가지는 item을 history에서 retrieved 함 —> testing dataset with higher data quality.

Retrieval-enhanced Instruction Tuning: 기존 N개의 training dataset이 있다고 하면 SUBR을 통해 뽑은 N개를 더해 2N 개를 사용. —> 이걸 이용해 instruction tuning (2N를 통해 overfitting 방지, robustness, generalization 추가) —>causal language modeling objective for instruction tuning | | | User-LLM | Feb 2024 - Google | Leverage user embeddings to contextualize LLMs User embeddings distilled from self-supervised pre training, capture latent preferences And integrate these with LLMs using **cross attention and soft-prompting

This outperforms text-prompt-based contextualization on long sequence task + being efficient

Downstream tasks for evaluation

Next item prediction
Favorite genre or category prediction
Multimodal review generation** | Two mechanisms: generating user embeddings, distilled from history data and contextualize LLMs with embeddings and **user history in text prompt

In phase one**, we pretrain a Transformer-based encoder on user interaction data, utilizing self-supervised learning to capture behavioral patterns across multiple interaction modalities. We use a multifeature autoregressive Transformer to generate embeddings that capture long-range dependencies and contextual relationships within sequential data while handling multimodal user data effectively.

In phase two, we integrate user embeddings with an LLM during finetuning using crossattention, where the LLM’s intermediate text representations attend to the output embeddings from the pretrained user encoder, enabling dynamic context injection - **similar to how Flamingo works

USER-LLM integrates Perceiver (Jaegle et al., 2021) units into its projection layers. Perceiver is a Transformer-based architecture that utilizes a trainable latent query to extract relevant information from input representations through cross-attention.** | | | CCFCRec | WWW 23 | content CF module + co-occurence CF module 의 contrastive learning을 통한 blurry cold start item embeddings 문제 해결

메인 수식: For this purpose, we regard the CBCE q𝑣 and the COCE z𝑣 as the content view and behavior view of an item 𝑣 and conduct a contrastive learning between the two item views. In particular, during the training phase, the parameters of CBCE encoder 𝑔c will be adjusted according to COCE so as to maximize the mutual information between the two item views. To achieve this goal, for a training item 𝑣, we build its positive sample set as N_v_+ = {𝑣+ : U𝑣 ∩ U𝑣+ ≠ ∅}, i.e., the items with which some user interacted together with 𝑣, and its negative sample set N_v_- = V \N+ 𝑣 . To maximize the mutual information between q𝑣 and z𝑣+ , and minimize that between q𝑣 and z𝑣− , contrastive loss of InfoNCE is used. | co-occurence collaborative signals in warm data 를 이용해서 alleviate the blurryness of collaborative embeddings for cold start item

Co-occurence collaborative signals: positive item embeddings should be closer to user embedding than those of negative items of the same user.

하지만 cold start item 의 embedding을 바로 encode 하기에는 item content 정보만 존재하기 때문에 불가능하다! —> 이 문제를 해결하기 위해 training phase에서 co-occurence signals를 memorize 하게 하고 blurry embeddings을 바로잡게 함. = indirect injection of co-occurrence signals

Maximize the mutual information between content-based collaborative embeddings and co-occurence collaborative embeddings

| |