Multi-Modal RecSys Paper Review

Title	Year	Abstract/Limitation	Summary
MMGCN	2019	Dataset:

Tictok: 3-15 secs video, Data contain users, micro-videos, and clicks
Kwai: contain privacy-preserving user info, content features of videos, interaction data (acoustic info(audio) is missing)
MovieLens: collect titles and descriptions from movielens, then crawl corresponding trailers —> Use ResNet50 to extract visual features from key frames + separate the audio with FFmpeg, and adopt VGGish to learn audio + Sentence2Vec to learn text

Previous Limiations: Semantic gaps between different modalities Different tastes on modalities of a micro-video (화면은 마음에 들었지만 오디오가 마음에 안 듬) | micro-video에서 좋은 추천을 제공하기 위해 유저와 아이템의 연결성을 고려하는 것이 중요할 뿐만 아니라 아이템 컨텐츠의 다양한 modality들을 고려하는 것이 중요하다고 주장합니다. 이에 따라서 각각의 modality의 representation을 학습하기 위해 유저-아이템 간의 그래프 구조를 활용한 MMGCN 모델을 제안하였습니다. 해당 모델의 경우 aggregation layer와 combination layer를 통해 modality의 특성들을 반영할 수 있었으며, Tiktok, Kwai, MovieLens 데이터에서 제안 모델의 우수성을 보였습니다.

Construct a user-item bipartite graph on each modality

Aggregation Layer: average pooling operation on each modal-specfic features Combination layer: combine content features + raw embedding + aggregated representation (Ex. User features + User/item aggregated representation + User embedding) |