Text data filtering and filtering visualization demo that was part of the BigScience main training dataset preparation.
The filtering pipeline consists of two steps:
The demo and more detailed information on the filtering pipeline can be found here. The code is available here and most methods are defined in this file.
Note: This demo can be a little slow, and only allows you to process up to 5000 documents for a decent speed. If you want to display up to three times more documents and have a faster visualization, we invite you to run this code on your computer.