I've spent my past little bit understanding a lot more about what self-driving is + how we can really commercialize and create generalizable, end2end self-driving.

Recently, I shifted from building DataGAN to focusing on this goal that I've had for quite a while right now. The TLDR; of what I'm doing is that I'm focusing on being able to take in monocular camera inputs, and being able to generate control values i.e. steering, acceleration, brake.

What I've realized with DataGAN is that I don't really know enough. I've been focusing on trying to create new approaches with self-driving without fully understanding it.

That's my current goal with end2end learning. This is something that I've been super fascinated about end2end learning and really want to understand it better. It's also the best way I can really go from 0-1 right now and make legit, compounding progress.

Overall architecture + proposed approach

Figure 1. Proposed diagram of creating an end2end model

Figure 1. Proposed diagram of creating an end2end model

The overall architecture that I’m currently building out is to use Convolutional Neural Networks to predict the speed and steering given an input image [similar to what NVIDIA did]. Why? CNNs have shown to have state-of-the-art performance for performing tasks such as detection, steering, speed, and a lot more. This also makes it the best way for me to implement this in predicting speed + steering [and solving regression problems].

I’m also using autoencoders as a tertiary network to see whether we can also have a network understand the current see and how it would look like 1 scene later. This is a metric that I’m using to see if the network understands the current scene and how steering + speed changes its prediction for the next frame.

Problems that I’m trying to figure out:

  1. Interpretability of neural nets. How can I show that I’ve fully understood the neural network and whether it has an understanding of temporal information?
    1. The diagram above only shows an understanding of single-frame instances. This is my focus right now. But I’m currently researching how I can use ConvLSTM networks to leverage spatio-temporal understanding and better understand to see whether the model understands how current predictions impact the future.
    2. The other thing is to visualize and better understand feature activations of the weights in a neural network. NVIDIA also released a paper about this that I’m looking into from mapping predictions to pixels that activated weights in the network. I’m currently researching on how we can better understand the link between predictions and neural activations.

Here’s a demo of steering on a dataset that I’ve been playing around with:

My ask right now:

  1. My biggest ask would be for anyone who’s reading this that has experience in robotics + computer vision would be to meet! I currently have a bunch of proposed ideas that are yet to be validated. Here’s my calendly link to book a meeting, would seriously love to chat and get your thoughts on what I’m building.
    1. Specifically, I’m trying to understand how ConvLSTMs and leveraging spatiotemporal data would allow for better accuracy. Does it make sense for speed and steering?
  2. Currently, I'm struggling A LOT when it comes to getting the dataset up and running as a result of minimum compute. Ex. the comma dataset is around 80GB worth of data and training an epoch right now (even when broken up into 5GB chunks) is still super difficult. I'm wondering if anyone would be willing to help me out specifically by granting access to a GPU + AWS/Azure credits. I'm currently looking for anywhere between $2,000 and $5,000 in credits to allow me to pursue and build LEGIT, high quality projects!