Filtering방법론: 정규표현식을 이용하여, 필터링
이전 EDA랑 동일
original data
https://huggingface.co/datasets/allenai/tulu-3-sft-mixture
filtered data
https://huggingface.co/datasets/aeolian83/allenai-tulu-3-sft-mixture_filtered
데이터 형태
token_count
count 377,652 mean 351 std 790 min 2 25% 121 50% 272 75% 446 max 110,503 sum 132,459,755

original data
https://huggingface.co/datasets/allenai/llama-3.1-tulu-3-405b-preference-mixture