data science -2

Answer by Pronojit Saha:

SELF STARTER WAY

For a self-starter novice, here is an outline that one can start with. (this is reproduced from my blog- How to acquire the "Essential Skill Set"?- the Self Starter way).0. Basic Pre-requisites:

- Mathematics, Algorithms & Databases: Mathispower4u-Calculus, Coursera-Linear Algebra, Coursera-Analysis of Algorithms, Coursera- Introduction to Databases
- Statistics: Probability and Statistics for Programmers, Statistical Formulas For Programmers, Coursera- Data Analysis, Coursera- Statistics One
- Programming: Google Developers R Programming Lectures, Scientific Python Lectures, How to Think Like a Computer Scientist
1. Acquire & Scrub Data:

- DFS & Databases: Hadoop Tutorial – Yahoo, BigDataUniversity: Big Data Course, Hortonworks Sandbox, Learning to Process Big Data with MapReduce and Hadoop – Hands-On Exercises
- Data Munging: Predictive Analytics: Data Preparation, Data Wrangling in Pandas, Data Wrangler, OpenRefine
2. Filter & Mine data:

- Data Analysis in R: Data science in R, Coursera-Computing for Data Analysis in R

- Data Analysis in Python (numpy, scipy, pandas, scikit): Getting Started With Python For Data Science, SciPy 2013- NumPy Tutorials, Statistical Data Analysis in Python, Pandas (1st Video Below), SciPy 2013- Introduction to SciKit Learn Tutorial I & II (2nd & 3rd Video Below)

http://www.youtube.com/watch?v=DXPwSiRTxYY&feature=youtu.behttps://www.youtube.com/watch?v=r4bRUvvlaBwhttps://www.youtube.com/watch?v=uX4ZirOiWkw- Exploratory Data Analysis- Exploratory Data Analysis in R, Exploratory Data Analysis in Python, UC Berkeley: Descriptive Statistics, Basic Unix Shell Commands for the Data Scientist
- Data Mining, Machine Learning:

http://www.youtube.com/watch?v=quQmnnb09yQData Mining Map, Coursera-Machine Learning, A Programmer's Guide to Data Mining, STATS 202 Data Mining & Analysis, Mining Massive Data Sets-Stanford, Learning From Data – CalTech, Coursera-Web Intelligence & Big Data.3. Represent & Refine Data: Tableau-Training & Tutorials, Data visualisation in R with ggplot2 and plyr, Predictive Analytics: Overview and Data visualization, Flowing Data-Tutorials, UC Berkeley-Data Visualization, D3.js Tutorial

4. Domain Knowledge: The Black-Box, as per your interest.

Combining all the above:

Data Literacy Course — IAP

UC Berkeley Introduction to Data Science

Coursera-Introduction to Data Science

Teach Data Science-Syracuse UniversityApply the knowledge:

Harvard Data Science Course Homework

Analyzing Big Data with Twitter

Analyzing Twitter Data with Apache Hadoop

FORMAL WAY

For a more formal way of becoming a data scientist one can look into this post (reproduced below)- How to acquire the "Essential Skill Set"?- the Formal way.

The Essential Skill Set are the basic fundamental skills which every data scientist is expected to know. Traditionally, these can be acquired by undertaking a computer science degree or a statistics degree from an institution. The Stanford Computer Science courses & Statistics courses provide a good reference list of courses to undertake. Now some of the courses are relevant while many others are not. For example in Computer Science while one would do good to learn about large scale distributed databases & algorithms but there is no need for learning HCI and UX, or pureplay storage and operating systems, networking, etc. Similarly some statistics courses focus too much on, lets say, "old school statistics" including thousands of ways of hypothesis testing instead of more on machine learning (clustering, regression, classification, etc). So both the streams have many nice to have courses and must have courses for a data scientist (I dare to claim that at present the percentage of must have courses seems to be greater in a traditional Statistics stream than a Computer Science stream). As such one needs to pick the courses wisely.Or alternatively, one can also look into a number of new Data Science courses that some universities are offering harping on the points I mentioned above. They combine the must have courses from both the traditional statistics and computer science program to impart the 4 Essential Skills as well as include courses to develop the Differentiator Skills in students. The MS in Data Science at NYU & MS in Analytics at USF are good examples of such amalgamation of the requisite courses. A complete list of such courses is presented here- Colleges with Data Science Degrees.

The correct program obviously depends on the individual's goal. One of the recent O'Rielly publications titled 'Analyzing the Analyzers' does a very good job in aggregating the various data scientist roles into 4 main categories as per their skills. An individual may therefore select a program as per the category of data scientist he most identifies himself with, as shown below.

Data Businesspeopleare the product and profit-focused data scientists. They're leaders, managers, and entrepreneurs, but with a technical bent. A common educational path is an engineering degree paired with an MBA or the new Data Science programs as mentioned above.Data Creativesare eclectic jacks-of-all-trades, able to work with a broad range of data and tools. They may think of themselves as artists or hackers, and excel at visualization and open source technologies. They are expected to have a engineering degree (mostly in statistics or economics) but not much in business skills.Data Developersare focused on writing software to do analytic, statistical, and machine learning tasks, often in production environments. They often have computer science degrees, and often work with so-called "big data".Data Researchersapply their scientific training, and the tools and techniques they learned in academia, to organizational data. They may have a MS or PhDs in statistics, economic, physics, etc., and their creative applications of mathematical tools yields valuable insights and products.The skills associated with the 4 main categories, which justify the above mentioned program recommendation, are as below:

How do I become a data scientist?