After receiving some comments on my PR for improving the Unique aggregator, I did some experimentations and now need to settle on a design.

Background

  1. There is a broader effort to move data preprocessors to Aggregate V2.
  2. I moved the RobustScaler preprocessor to ApproximateQuantile (pr).
  3. I was looking to do the same for our Encoders (OneHot , MultiHot , etc)

Current Encoder Implementation

When using encoders, we currently have two types of aggregations:

  1. We have to calculate the unique values of a column (to calculate encodings)
  2. We have to calculate topk of a column (we provide max_categories options for OneHot and MultiHot)

Note that for each column, we need to calculate either all unique values or the topk.

Currently, this was both done via a function called compute_unique_value_indices , in encoder.py

The problems of this function:

So, the idea for improvement was, instead of using this custom function, we would use the Unique and ApproximateTopK aggregations.

Tricky Part About Encoders

The tricky part about encoder implementation is that columns may be composed of lists. This is especially the case for MultiHotEncoder.

In Ray Data, we assume: