Philipp Singer #1 Kaggler, talk published on on the youtube MLT channel
https://www.youtube.com/watch?v=OenmJTdF0-M
focus: how Philipp started his journey and what he learned along the way
-
Facebook recruiting competition
- first competition 9 years ago
- only import sci-kit learn, do one or two submissions
- stopped kaggling for 7 years! (Kaggle seemed very overwhelming)
-
Quora insincere question challenge
- elegant solution - simple solutions (great for learning) can be really, really good
- entered the competition to learn about the state of the art, deep learning - ended up winning
- Learnings:
- Kaggle is fun!
- I can learn a ton of new stuff and apply it to real datasets and problems
- Community is great - you learn a lot from others
-
Next steps - multiple competitions, looking at new and dfiferent types of problems (problems you have little experience with are great learning grounds and a great test of your machine learning skills!)
- many problems are interconnected - tabular, vision, nlp, experience and skills carry across
- teaming up is great
- you can learn from others
- it is more fun
- you make connections
- finding good team mates is hard, especially before you create a name for yourself
- requires commitment
- requires a lot of communication and honesty
- Philipp prefers early teaming (vs late teaming) - you can learn more this way (late teaming is common but is mostly used for blending solutions, to increase position by a tiny bit)
-
NFL Big Data Bowl
- model was put into production on live national TV! 🥳
- "simple solutions can be beautiful"
- "you need to spend time and many iterations of experiements are important"
- sometimes you have to take a step back to move forward! (explore many dead ends)
- "Building complex models is easy, building simple models is hard"
-
Two NLP competitions Philipp joined at the same time
- pretrained models - custom tokenizer was key
- model distillation worked really well
- Learnings:
- going solo in a competition:
- you don't owe anyone anything, it's okay if you chose to not commit half-way along the way
- not having anyone to talk to is not pleasant, also - it helps a lot when someone else can verify your thoughtprocess
- tokenization, preprocessing for nlp competitions can be very important
-
Google Landmark Recognition 2020, code on github (clean code, associated paper)
- Nice, clean architecture
-
Focus on the process - how to organize your work, how to work better together as a team (good software dev collaboration practices! 🙂)
-
code-sharing on github
-
branches, pull requests
-
config files
-
logging also very important! (tensorboard, neptune.ai, sacred, W&B, etc - they used neptune)
-
Rainforest connection species audio detection
- first experience with audio
- only partial labels for train (hard & weak label models, pseudo-tagging, large blends, per-species adjustments)
- masked-labeling proved key
- Learnings:
- "sometimes public LB is the best validation fold" (test was so different from train, overfitting to the LB this time around made sense)
Key learnings
Highlights to me:
- validation is key (both on Kaggle and in RL) 😎
- "engineering is very imoprtant and an important part of a good data scientist"
Q&A
How to chose competitions to participate in?