Information Retrieval (Indexing)

We want to be able to match different query terms and have a matching concept come up (e.g. synonyms should not matter, etc.).

At a high level, we want to convert a query to a query representation with a representation function, and then get results from it.

We need some kind of representation that captures the meaning of a document.

this doesn’t need to be reversible (documents with similar content should have similar representation)

Most simply, could use “bag of words” approach:

assume all words in document are independent of others, i.e. presence of 1 word is irrelevant to meaning of other words (bad assumption)

Words can be in many languages, but we generally use English. We process the words of a document as follows:

tokenizer - remove punctuation from document, etc.
case folding - treat things as lower case, unicode in canonical form (note we do lose some information e.g. Bush vs bush)

One possible approach is creating an embedding (representation) of a document.

take high dimensional object (text document) and embed in lower-dimensional plane, e.g. embed sentence in $\mathbb R^{768}$
Distance between embeddings would be cosine, and the distance between words would be their semantic similarity
Goal for an embedding would be for distance between embeddings to be similar to distance between words

Alternatively, we use a bag of words (i.e. word count).

each row is a “posting”
we have index mapping - each cell indicates if a word $i$ is in a given document
storing position is possible - this will allow searches for phrases

Inverted Index - map context to documents