Journey to Kaggle #1

Philipp Singer #1 Kaggler, talk published on on the youtube MLT channel

https://www.youtube.com/watch?v=OenmJTdF0-M

focus: how Philipp started his journey and what he learned along the way

Facebook recruiting competition
1. first competition 9 years ago
2. only import sci-kit learn, do one or two submissions
3. stopped kaggling for 7 years! (Kaggle seemed very overwhelming)
Quora insincere question challenge
1. elegant solution - simple solutions (great for learning) can be really, really good
2. entered the competition to learn about the state of the art, deep learning - ended up winning
3. Learnings:
  1. Kaggle is fun!
  2. I can learn a ton of new stuff and apply it to real datasets and problems
  3. Community is great - you learn a lot from others
Next steps - multiple competitions, looking at new and dfiferent types of problems (problems you have little experience with are great learning grounds and a great test of your machine learning skills!)
1. many problems are interconnected - tabular, vision, nlp, experience and skills carry across
2. teaming up is great
  1. you can learn from others
  2. it is more fun
  3. you make connections
  4. finding good team mates is hard, especially before you create a name for yourself
  5. requires commitment
  6. requires a lot of communication and honesty
  7. Philipp prefers early teaming (vs late teaming) - you can learn more this way (late teaming is common but is mostly used for blending solutions, to increase position by a tiny bit)
NFL Big Data Bowl
1. model was put into production on live national TV! 🥳
2. "simple solutions can be beautiful"
3. "you need to spend time and many iterations of experiements are important"
4. sometimes you have to take a step back to move forward! (explore many dead ends)
1. "Building complex models is easy, building simple models is hard"
Two NLP competitions Philipp joined at the same time
1. pretrained models - custom tokenizer was key
2. model distillation worked really well
3. Learnings:
  1. going solo in a competition:
    1. you don't owe anyone anything, it's okay if you chose to not commit half-way along the way
    2. not having anyone to talk to is not pleasant, also - it helps a lot when someone else can verify your thoughtprocess
    3. tokenization, preprocessing for nlp competitions can be very important
Google Landmark Recognition 2020, code on github (clean code, associated paper)
1. Nice, clean architecture
1. Focus on the process - how to organize your work, how to work better together as a team (good software dev collaboration practices! 🙂)
2. code-sharing on github
3. branches, pull requests
4. config files
5. logging also very important! (tensorboard, neptune.ai, sacred, W&B, etc - they used neptune)
Rainforest connection species audio detection
1. first experience with audio
2. only partial labels for train (hard & weak label models, pseudo-tagging, large blends, per-species adjustments)
3. masked-labeling proved key
4. Learnings:
  1. "sometimes public LB is the best validation fold" (test was so different from train, overfitting to the LB this time around made sense)

Key learnings

Highlights to me:

validation is key (both on Kaggle and in RL) 😎
"engineering is very imoprtant and an important part of a good data scientist"

Q&A

How to chose competitions to participate in?