Inspired by continuous bag of words language models, we learn high dimensional embeddings for each video in a xed vocabulary and feed these embeddings into a feedforward neural network
During ranking, we have access to many more features describing the video and the user's relationship to the video because only a few hundred videos are being scored rather than the millions scored in candidate generation.
具体一点,从左至右的特征依次是:
impression video ID embedding: 当前要计算的video的embedding
watched video IDs average embedding: 用户观看过的最后N个视频embedding的average pooling
language embedding:用户语言的embedding和当前视频语言的embedding
time since last watch: 自上次观看同channel视频的时间
#previous impressions: 该视频已经被曝光给该用户的次数
上面五个特征中,我想重点谈谈第4个和第5个。因为这两个很好的引入了对用户行为的观察。
第4个特征背后的思想是:
We observe that the most important signals are those that describe a user's previous interaction with the item itself and other similar items.
有一些引入attention的意思,这里是用了time since last watch这个特征来反应用户看同类视频的间隔时间。从用户的角度想一想,假如我们刚看过“DOTA经典回顾”这个channel的视频,我们很大概率是会继续看这个channel的视频的,那么该特征就很好的捕捉到了这一用户行为。