Unique Aggregator Improvements

After receiving some comments on my PR for improving the Unique aggregator, I did some experimentations and now need to settle on a design.

Background

When using encoders, we currently have two types of aggregations:

We have to calculate the unique values of a column (to calculate encodings)
We have to calculate topk of a column (we provide max_categories options for OneHot and MultiHot)

Note that for each column, we need to calculate either all unique values or the topk.

Currently, this was both done via a function called compute_unique_value_indices , in encoder.py

The problems of this function:

It keeps track of the count for each unique value encountered, even when calculating unique (not topk) where this information is not necessary.
Its done via iterating over the dataset with iter_batches, so its fairly inefficient.

So, the idea for improvement was, instead of using this custom function, we would use the Unique and ApproximateTopK aggregations.

The tricky part about encoder implementation is that columns may be composed of lists. This is especially the case for MultiHotEncoder.

In Ray Data, we assume: