Executive Summary

I have discovered how to make neural networks perform complex reasoning in constant time, regardless of input complexity. The Direct Semantic Reasoning Unit (DSRU) processes entire thoughts, concepts, and tasks in a single forward pass—no tokens, no attention, no scaling bottlenecks.

This is achieved by direct semantic vector to semantic vector transformation, trained through supervised learning on various reasoning and instruct tasks, among others. The results were noteworthy:

A 93x throughput advantage over Zephyr 7B
A 19x latency advantage over Zephyr 7B
Identical accuracy running in batched mode to Zephyr 7B (77.7%) - unbatched accuracy for Zephyr was 80%
Core model inference time of ~1ms, end to end weighted GPU inference time per example across the full funnel (encoding, DSRU reasoning, decoding) of ~1.3ms

Reasoning From First Principles:

‭I was working on a project to provide AI matchmaking tools that could provide rapid judgments about compatibility across hundreds of stored insights about a given user, but without relying on simple semantic similarity - I wanted these ratings to be grounded in a reasoned judgement about what actually matters to compatibility. LLMs could do the job, but were unfortunately cost-prohibitive and time prohibitive. In an attempt to achieve faster and cheaper task-specific inference for a project I was work, I began to evaluate existing ML‬‭architectures and tools and asked myself, "Can the elements of this be reconfigured in a way‬‭that enables a promptable, smart classifier?"‬

‭I ruled out many potential options (couldn't use softmax or attention because of scaling costs),‬‭but one that stuck with me was the idea of a direct vector to vector transformation. This was my‬‭ reasoning:‬

Neural Nets are universal function approximators
Semantic embeddings are just vectors
Vector transformations can generally be achieved with functions
In principle, a neural network of sufficient depth and expressivity should be able to represent any such transformation or set thereof, provided there is enough information to enable the desired transformation

The core remaining question, from my perspective, was whether or not there was enough information in the semantic embedding to make reasoning or at least task completion possible. This went against the conventional wisdom of the field (not something I knew at the time), but the experiment was cheap and accessible - within a few days of experimentation, I found an approach that produced significant convergence on the NIV2 dataset, with validation accuracy steadily climbing. It plateaued earlier than one would need from a practical tool, but it showed clear gain in signal, even in a fairly naive attempt at implementation and a small model. From there, it was simply a matter of experimentation and engineering.

I've since come to believe that this is possible because modern semantic embeddings live in a sort of "inherently attended space". Semantic vector embeddings are essentially just the series of weighted sums that is output from the attention layer as an input to the hidden layers, with a set of transformations applied. This includes the terminal output of a semantic vector. If this is the case, applying attention to them isn't strictly necessary - they are the product of attention, albeit intentionally altered and compressed.

However, this is true of any new layer in a NN – it is a transformed (and potentially compressed or expanded) representation of what came before it – and for that reason, semantic embeddings are no different from a compressed version of the layer that came before it - projecting down to 1024 dimensions in the final layer is no different from projecting down to 1024 dimensions on any intermediate layer.

The unique part of semantic embeddings, however, is what they target as their outputs - in particular in the case of bge-large, a representation of meaning which is normalized to a unit hypersphere, making it trivial to compare to other meanings. However, it appears that despite being induced to conform to this format, it still has sufficient artifacts of attention to behave as an attended value.

This also implies that any neural network can follow a similar pattern, until it can't - outputting even one layer early as a vector and examining its output essentially gives you a "semantic embedding" in the sense that’s a vector which has some sort of logical or linguistic meaning. What makes the modern semantic embeddings from a trained embedding engine useful is that unique ability to make simple mathematical comparisons to know their degree of relatedness in meaning.

This allows a primitive form of observability, even without advanced or computationally expensive techniques like Vec2Text – ‘guess and check’ works. While not sophisticated, guess and check is both fast and useful, when applied cleverly. It’s entirely possible that we can create more sophisticated guess and check systems that allow for quick conclusion about what is in an embedding by performing a NNS against a set of summaries of various topics, precise enough to hone in on a meaning, but vague enough to allow for flexibility in interpretation. This cannot tell us what the embedding says…but it can tell us what it is about.

The DSRU is, essentially, the first natively attended model architecture. It doesn’t provide its own attention; it borrows it from upstream elements in the system it is integrated with. Given the quadratic complexity of attention being applied on every forward pass, and the typical limitation of the transformer architecture to token-by-token operation, this essentially provides two dimensions of efficiency over a standard transformer; the elimination of the quadratic cost of attention in each forward pass, and expansion of the scope of the forward pass to include include the entire semantic embedding rather than operating over a single token.

Key Innovation: A neural architecture that performs semantic transformations in O(1) time, enabling reasoning over an entire ‘thought’ in each forward pass, rather than tokens or other linguistic subunits.

Proof of Concept: I demonstrate this capability through a promptable, intelligent classification system with inference-time configurable vocabulary that:

Achieves 93x higher throughput than a Zephyr 7B LLM achieving comparable accuracy running on the same hardware