After receiving some comments on my PR for improving the Unique aggregator, I did some experimentations and now need to settle on a design.
RobustScaler preprocessor to ApproximateQuantile (pr).OneHot , MultiHot , etc)When using encoders, we currently have two types of aggregations:
max_categories options for OneHot and MultiHot)Note that for each column, we need to calculate either all unique values or the topk.
Currently, this was both done via a function called compute_unique_value_indices , in encoder.py
The problems of this function:
iter_batches, so its fairly inefficient.So, the idea for improvement was, instead of using this custom function, we would use the Unique and ApproximateTopK aggregations.
The tricky part about encoder implementation is that columns may be composed of lists. This is especially the case for MultiHotEncoder.
In Ray Data, we assume: