This assignment is meant to emulate a day’s work as a Data Scientist here at Polarr (the dataset in the assignment is public, but type of analysis would be similar). Expect some tasks to be trivial, while others a bit more challenging and open-ended. Feel free to stack-overflow, Google, and research any questions/issues you have. You are also welcome to use any programming language. The only thing we ask is that this work is done independently by you without the help of your family/friends. Recommended Time Limit: 8 hours (you should not spent more time than this, but if you do, that's okay as well).

When you have completed the assignment, please name the file in accordance with this naming pattern: [YOUR FULL NAME]-Assignment for [NAME OF THE ROLE THE ASSIGNMENT IS FOR] and upload it to **this dropbox.**

Question 1: Getting the Data

Our backend engineer scraped at txt file about half a million entries on Amazon for products like books, CDs, DVDs, and VHS tapes. For each product, the following information is available:

Below is a screenshot of the first lines of this massive txt file:

# Full information about Amazon Share the Love products 
Total items: 548552

Id:   0
ASIN: 0771044445
  discontinued product

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group: Book
  salesrank: 396585
  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
  categories: 2
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
  reviews: total: 2  downloaded: 2  avg rating: 5
    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9
    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5

Download the txt file. Clean up this dataset so that it only contains book entries, and is ordered based on the sales rank. You should get a total of 393561 books. Your sample output looks like the following:

Id, ASIN, title, salesrank
34,021231242,Editing with Polarr,1
78,321231232,Intro to Photography,2
...

Explain your methodology or any code you had to write in order to accomplish this task. What would be the estimated compute time to do this had it been 30 million book entries? How would you change your approach differently in this case?