Text data filtering and filtering visualization demo that was part of the BigScience main training dataset preparation.

The filtering pipeline consists of two steps:

  1. The documents will be modified to remove excessively long or incorrect words
  2. On these modified documents, filtering is performed to determine whether or not the document is kept

The demo and more detailed information on the filtering pipeline can be found here. The code is available here and most methods are defined in this file.

Note: This demo can be a little slow, and only allows you to process up to 5000 documents for a decent speed. If you want to display up to three times more documents and have a faster visualization, we invite you to run thisĀ code on your computer.

Screenshot 2022-02-25 at 10.36.21.png