@Datta Nimmaturi @Gavrish Prabhu. 9 Mar 2024

Previous Weekly

Microsoft Expands AI Portfolio with Mistral AI Deal:

Microsoft made an investment of €15 million ($16.3 million)  into Mistral and announced it would bring Mistral’s newest AI model, Mistral Large, to Azure. But the investment into Mistral, mainly a developer of open-source AI models, doesn’t mean Microsoft lost faith in its first AI child, OpenAI. Instead, Microsoft is laying the groundwork to build Azure as a model garden and give itself a foothold in Europe.

Elon Musk sues OpenAI:

Elon Musk is suing OpenAI and its CEO Sam Altman, saying the company behind ChatGPT has diverged from its original, nonprofit mission by partnering with Microsoft for $13 billion and keeping its code for its newest generative AI products a secret.

OpenAI was founded as a check on what the founders believed was a serious threat artificial generative intelligence, or AGI, posed to humanity. The company created a board of overseers to review any product the company created and its products’ code was made public.

When it comes to legal disputes, Elon Musk’s definition of victory may not always be winning in court.

BitNet 1.5B: The Era of 1-bit LLMs:

This builds upon BitNet from Microsoft. While BitNet only uses 2 bits {0,1} to represent weights, BitNet 1.5B uses ternary* weights {-1,0,1}. The biggest advantage here is, if you restrict the weights to the said space, you don’t need to do expensive Matrix multiplications. Any multiplication $W*x$ where $W$ is weight and $x$ is input can be rewritten as sum and difference of elements in $x$. For example,

$$ W*x = \begin{bmatrix} 1 & 0 & 0 \\ -1 & 1 & 0 \\ 0 & 1 & -1 \end{bmatrix} * \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = \begin{bmatrix} x_1 \\ -x_1+x_2 \\ x_2-x_3 \end{bmatrix} $$

And this is how the quantisation function looks like. The $\epsilon$ here is to prevent overflows and underflows. Basically we’re scaling the weight matrix by its Mean absolute weights in matrix.

Untitled

This provides significant speed up while inference, upto ~10x (can fit higher batch size) in throughput and ~4x in latency terms compared to full/half precision (16bit and 32bit like FP16/32 and BF16) representation. All this is achieved while on par or better than the usual models (author trained LlaMA of different sizes and StableLM 3B).

While using Binary or Ternary still takes 1 Byte for representation, the biggest advantage is in compute cost. And if dedicated hardware is built for this, this can be even more efficient.

Untitled