Multi-modal LLM based RecSys Paper Review

Title	Year	Abstract/Limitation	Summary
Rec-GPT4V	2024	Backbone: GPT4-V, LLaVA-7b, LLaVA-13b

Limitation of this paper

Still only with text input, limited extracted visual features
VST but simple data fusion of (title1, summary1), (title2, summary2)

Prior Limitations:

LVLM lack user preference knowledge + MMRecsys suffer from lack of user-item interaction data + media like image contain viral marketing highlights Image sequence of user history 를 잘 이해 못함 —>하지만 NLP text로는 sequential info 잘 이해함!
Suffer setbacks in addressing multiple image dynamics which include discrete, noisy, and redudant image sequences. - GPT4V는 비디오 전용인데 결국 MMRSs 는 discrete noisy image들을 받으니까(?) —> 그냥 이미지랑 아이템 제목이랑 붙이는건 오히려 성능이 별로였다 —> 획기적인 concatenation 방법이 필요

한번에 여러개의 이미지와 유저 정보를 이해하기에 어려움을 겪음 (이미지 따로 유저 따로면 모를까 매핑된 데이터들을 이해하기 어려움)

”이미지 정보와 유저/아이템 정보를 효과적으로 엮어서 집어넣는게 중요!!!” (단순 concat X) —> 이미지를 바로 넣는게 아니라 써머리를 뽑아서 써머리를 sequential 하게 넣음!

In-context learning, CoT도 다 NLP 적인 task를 위한 LLM reasoning algorithm —> “Thus, effective LVLM-based MMRS requires the design of specific prompting strategies that can utilize their visual comprehension strength without caving to the complexities associated with processing multiple images simultaneously” | Use pretrained LVLM as ranker for recommendation given candidate pool

LVLM에 Text로 sequential understanding + static image understanding

Propose Visual Summary Thought reasoning principle

Utilize user historical interactions as contextual data (sequence of item titles and images as inputs)
Prompt only one static image to LVLM to obtain corresponding summary instead of handling multiple images.
Then, construct user sequences by substituting images with their textual comprehensions

This VST overperforms compared to other reasoning strategies, such as concatenation, In-context learning, Chain of Thought. VST: 1번 아이템부터 m번 아이템까지 알려주고 n개의 후보중에 뽑으라함 ICL: 1번 아이템부터 m-1번 아이템 알려주고 m번 아이템 추천해야된다고 가르침. 그리고 추천 요구 (후보 풀 x) CoT: ICL 에다가 Please think step by step 만 추가함

**** | | UniMP | ICLR 2024 | **Previous: multi-modal for single rec task or multi-modal for multiple task(VIP5 but this lack flexibility for diverse input) Develop into Multi-modal + multiple Task by CLIP + instruction tuned LLM

Limitation of this paper:

Naive data fusion (just flattening (image, emb), (title, mackbook), (brand, apple))
Lack of novelty (just combining existing cross attention, data fusion, unified generation framework)**

Previous problems:

fail to comprehend info spanning from various tasks and modalities, complexities about task-and modality-specific customization —> vision model extracts visual elements and llm is for reasoning and generation Visual info is conditioned on textual representation throughout the layers of LM.

In multi-modal personalization “However, all the above methods only focus on single personalized recommendation tasks and fail to model the mixture of visual, textual, and ID data, arranged in arbitrary sequences.”

In general recommendation “However, they fail to fully harness the potential of raw data and do not effectively capture the interactions among them. They lack the required flexibility to effectively handle the diverse input and output requirements inherent in multi-task learning”

| Generative Language Model for multi-modal personalization

CLIP + Instruction tuned 3b model (together.ai)

Incorportate heterogeneous user history information: rich heterogenous info like image, category, brand, description, price. Use Dictionary like {(Image, x), (brand, y)…} —> flatten these attribute pairs —> i_c each user has user sentence u_n = {i_1, i_2, …}

Fine-Grained Multi-modal user modeling: **ViT를 통한 fixed number of visual embeddings 과 u_n과의 cross attention

Integration of personalization tasks** unify multiple tasks together with user history information and then joint multi-task learning is done Multi-task learning is done by context reconstruction (loss = task loss + context loss), token-level reweighting (to reshape the loss function to re-weight the token prediction loss), retrieval-augmented generation beyond text(can also generate images other than text) —> ex) user-guided image generation (first retrieve relevant items based on history and given query, second, with retrieved images, images are generated) 예를 들어 Low top sneaker라는 query가 있어도 retrieved images에 색상에 따라 다른 image가 generate 됐더라.

| | Mysterious Projections | Feb 2024 | Previous Limitation: mLLM’s limited capabilities on images from domains like dermatology and agriculture. —> although when we fine-tune whole LLM, domain-specific richness of image’s post-projection representation does not improve

This means domain-specific visual attributes are predominantly modeled by LLM parameters and projection does not take role in mapping visual attributes to LLM

This Paper Limitation: Projection이 안 중요하다는 걸 알았는데 그래서 뭐? 어떻게 응용할건데?

| Experiment existing domain-specific fine-tuning & understand role of projection and LLM parameters in acquiring domain-specific image modelling capabilities

by comparing classification performance of updating only projection vs updating entire LLM including projection —> 하지만 두 기존 방법 모두 어느것도 domain specific features를 더 rich하게 만들진 않았다. —> the identification of domain-specific image attributes occurs in the LLM parameters, whether frozen or not. —>그 말은 즉, projection이 안 중요하다, minimal assistance from cross-modal projection으로 LLM이 visual data 충분히 이해 가능하다!

mLLMs components

cross-modal projection layer that connects image representations with the LLM
LLM that processes projected image representation and text tokens

Previous fine-tuning strategy:

update projection while keeping LLM parameters frozen
projection and the LLM parameters can be fine-tuned concurrently

만약, projection layer가 중요하다면, post projection representation should be richer in domain-specific features. 하지만 아니라면 (rich 하지 않다면) domain-specific features are being identified or modeled by LLM parameters

—>파인 튜닝하면 성능은 좋아지지만 domain-specific features의 표현력은 줄어든다! —>Projection layer를 업데이트 하면 frozen llm parameters를 facilatate할 뿐 LLM space에 image attributes 를 잘 mapping하는 것이 아니다! |