This assignment is meant to emulate a day’s work as a Data Scientist here at Polarr (the dataset in the assignment is public, but type of analysis would be similar). Expect some tasks to be trivial, while others a bit more challenging and open-ended. Feel free to stack-overflow, Google, and research any questions/issues you have. You are also welcome to use any programming language. The only thing we ask is that this work is done independently by you without the help of your family/friends. Recommended Time Limit: 8 hours (you should not spent more time than this, but if you do, that's okay as well).
When you have completed the assignment, please name the file in accordance with this naming pattern: [YOUR FULL NAME]-Assignment for [NAME OF THE ROLE THE ASSIGNMENT IS FOR] and upload it to **this dropbox.**
Question 1: Getting the Data
Our backend engineer scraped at txt file about half a million entries on Amazon for products like books, CDs, DVDs, and VHS tapes. For each product, the following information is available:
Below is a screenshot of the first lines of this massive txt file:
# Full information about Amazon Share the Love products Total items: 548552 Id: 0 ASIN: 0771044445 discontinued product Id: 1 ASIN: 0827229534 title: Patterns of Preaching: A Sermon Sampler group: Book salesrank: 396585 similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X categories: 2 |Books|Subjects|Religion & Spirituality|Christianity|Clergy|Preaching |Books|Subjects|Religion & Spirituality|Christianity|Clergy|Sermons reviews: total: 2 downloaded: 2 avg rating: 5 2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9 2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5
Download the txt file. Clean up this dataset so that it only contains book entries, and is ordered based on the sales rank. You should get a total of 393561 books. Your sample output looks like the following:
Id, ASIN, title, salesrank 34,021231242,Editing with Polarr,1 78,321231232,Intro to Photography,2 ...
Explain your methodology or any code you had to write in order to accomplish this task. What would be the estimated compute time to do this had it been 30 million book entries? How would you change your approach differently in this case?