We want to be able to match different query terms and have a matching concept come up (e.g. synonyms should not matter, etc.).
At a high level, we want to convert a query to a query representation with a representation function, and then get results from it.

We need some kind of representation that captures the meaning of a document.
- this doesn’t need to be reversible (documents with similar content should have similar representation)
Most simply, could use “bag of words” approach:
- assume all words in document are independent of others, i.e. presence of 1 word is irrelevant to meaning of other words (bad assumption)
Words can be in many languages, but we generally use English. We process the words of a document as follows:
- tokenizer - remove punctuation from document, etc.
- case folding - treat things as lower case, unicode in canonical form (note we do lose some information e.g. Bush vs bush)
One possible approach is creating an embedding (representation) of a document.
- take high dimensional object (text document) and embed in lower-dimensional plane, e.g. embed sentence in $\mathbb R^{768}$
- Distance between embeddings would be cosine, and the distance between words would be their semantic similarity
- Goal for an embedding would be for distance between embeddings to be similar to distance between words
Alternatively, we use a bag of words (i.e. word count).
- each row is a “posting”
- we have index mapping - each cell indicates if a word $i$ is in a given document
- storing position is possible - this will allow searches for phrases

Inverted Index - map context to documents