Main Summary
Instead of a single item encoder trained by matching loss with item text description encoder, I have implemented cross attention mechanism and contrastive learning to effectively integrate the multi-modality between item embeddings and meta data embeddings.
Sample Inference
Prompt: [User Representation] is a user representation. This user has bought "Merrell Trail Glove Barefoot Running Shoe - Men's"[HistoryEmb],"Salomon Men's XA PRO 3D Ultra 2 Trail Running Shoe"[HistoryEmb],"Hanes Men's Tagless Boxer Briefs with Comfort Flex Waistband"[HistoryEmb],"Hurley Men's Solid Phantom Boardshort"[HistoryEmb] in the previous. Recommend one next item of clothing for this user to buy next from the following item title set, "Naturalizer Women's Bola Espadrille"[CandidateEmb],"CC Junior's Rayon Camis 2 or 4 Pack"[CandidateEmb],"Champion Men's Tech Performance Boxer Brief"[CandidateEmb], …, "Marc by Marc Jacobs Women's MMJ 122/S Resin Sunglasses"[CandidateEmb],"Calvin Klein Women's Perfectly Fit Sexy Signature Demi Bra"[CandidateEmb],"OTBT Women's El Reno Bootie"[CandidateEmb]. The recommendation is
Answer: "Champion Men's Tech Performance Boxer Brief"
LLM output: "Champion Men's Tech Performance Boxer Brief"
Presentation
What I have learned