What is data science?
“Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured.”– Wikipedia
The goal of data science is to make sense of data in a way that is meaningful and can drive better decisions.
Imagine you were going to join a new school, with new kids, and you would be leaving your friends from your other school. Finding new friends from all those new kids would be quite a tiring task. How would you know which kids, from the new school, were more likely to be your friends?
A clever person like you, would write down all the characteristics you know about your current friends. For eg.
My Friends
You can see that all your friends like cartoons, even if a little. All of them, except one, have blue as their favourite color, and most of them love Math. So now we could use that information to find out which of the kids in your new school are most likely to be your friends.
If their favourite color is blue, and they like cartoons, they are extremely likely to be your friend. Still, any kid who loves cartoons is very likely to be your friend, even if their favourite color is red. Their favourite subject doesn’t matter much, but someone whose favourite color is blue, has Math as their favourite subject, and is a cartoon maniac is probably going to make a very good friend.
Now, that’s Data Science right there. It’s getting data, arranging it in a way it can be easily understood, and making decisions out of it.
You did several things to find new friends for yourself -
- Defining the problem: We identified the problem to be that finding new friends from a large number of kids in the new school would be tiring.
- Collecting the Data: In this case, we already had the data and only had to write it down. If we didn’t know the favourite colors for all the friends, we would have to go and ask them. All this is part of data collection.
- Processing the data: After getting the data, we need to look at it and see if there is some of it that we don’t need. If so, we remove that data. This is called cleaning the data. For example, we have the names of friends as part of the data. That data is not important when deciding who can be a good friend, since all the names are different. Therefore, we can safely clean out the names because we don’t need them. Sometimes you won’t have all the data you need. For example, if we didn’t have Vibhore's favorite color, we could simply look at all the other favourite colors, and safely determine that it is blue.
- Exploring and Analyzing the data: After cleaning the data, we need to look at the patterns and see if there are any trends. We noticed, for example, that all the friends liked cartoons. This was easy because they were only 5 friends. If we had 1 million friends to look at, it would take us forever to find a trend or pattern. That is where computers would help us. These computers would be given special instructions called algorithms to find patterns in this big data.
- Show the results of the analysis: After finding a pattern, we need to represent it in a way that it can be easily understood. Some common methods for showing this data include: Using tables, graphs and even pie charts.
https://www.youtube.com/watch?v=xC-c7E5PK0Y
Real examples how Data Science is being used
Identifying Breast Cancer