Currently we use ChatGPT to generate training data with two different prompts:
One is very simple and another one with all possible forms of the word, for example:
“Please provide 30 sentences in German using word grundsätzlich”
“Please provide 30 sentences in German using words: grundsätzlich, grundsätzliche, grundsätzlicher, grundsätzlichem, grundsätzlichen in different context and in different place inside of the sentence”
All Data (labeled, not labeled) can be found here:
English: https://drive.google.com/drive/folders/1KkM-lvtcGZn2p_RQSaCixNTi-aEWPbec?usp=sharing
German: https://drive.google.com/drive/folders/1T8qwsKcfyjY-8uOJ4U7uEHX3R7BvMOhL?usp=sharing
Expert need to label data for the classifier (Tracey)
Current labeled data for context aware False positives in ML model:
English sentences:
list of words: (dynamic, impact, fossil, flexible, best, retard, brilliant, retarded, alone):
German
(unabhängig, entschieden)
****https://www.notion.so/witty-works/Data-flagging-for-Elena-2nd-group-green-highlight-cc492c702c574521902ba394176585ef?pvs=4