Black Swans


8 points

I'm an compsci undergraduate and hoping to continue my learning about machine learning! I'm also a web dev 💻


That sounds like a great idea! Any idea whereabouts it would be hosted?

Hey Fernan, thanks for your reply! I agree, I sometimes wonder whether I don't understand concepts because I'm still a novice, because the paper is lacking detail or a mixture of both! 😂

I've been thinking about this some more, and I think I'm starting to make sense of it. Considering the aim of the method is to learn the representations of the paragraphs, I presume there must be some initial layer(s) that, given some naïve representation of paragraphs & words (perhaps randomised or one-hot encoded?), produce a denser representation. During training or inference, I'm thinking that the error back-propagates to this initial layer(s), which results in learning the vector representation of the paragraphs & words.

If that understanding is correct, then the inference stage makes more sense. Although the model has never seen the new paragraph before, it will have seen the words contained in it before (and have learned dense vector representations of them). The paragraph vector is initialised and the rest of the model's parameters are frozen/fixed. For the unseen paragraph "the cat sat on the mat", we can calculate the loss as we already have $w_{1}$ ("the"), $w_{2}$ ("cat") and $w_{3}$ ("sat"), as well as the target word "on". The error is back-propagated only to the initial paragraph representation layer(s), as the other layers of frozen/fixed. I presume this then continually happens for each subset of the paragraph, eg. (["cat", "sat", "on" => "the"], ["sat", "on", "the" => "mat"]).

Does this sounds correct? I sketched my thoughts out, if it helps at all.

I was wondering if anyone could help clear up my understanding of how the paragraph vector, D, is trained and then how paragraph vectors for unseen paragraphs are found.

My understanding is that D is optimised during training through backpropagation by calculating some loss between the true next word vector and some predicted next word vector. My confusion comes from how they would find paragraph vectors for unseen paragraphs. The paper makes mention to an inference stage, but I don't quite understand how the new paragraph vector would be inferred. Are they inferred by calculating the loss between the true next word vectors in the unseen paragraph and some predicted next word vectors?

Sounds great - I'd love to join