data science -2
Answer by Pronojit Saha:
SELF STARTER WAY
For a self-starter novice, here is an outline that one can start with. (this is reproduced from my blog- ).
0. Basic Pre-requisites:
- Mathematics, Algorithms & Databases: M, , ,
- Statistics:, , ,
- Programming:, ,
1. Acquire & Scrub Data:
- DFS & Databases:, , ,
- Data Munging:, , ,
2. Filter & Mine data:
- Data Analysis in R:,
- Data Analysis in Python (numpy, scipy, pandas, scikit):, , (1st Video Below), (2nd & 3rd Video Below)
- Exploratory Data Analysis-, , ,
- Data Mining, Machine Learning:
http://www.youtube.com/watch?v=quQmnnb09yQ, , , , -Stanford, , .
3. Represent & Refine Data:, , , , ,
4. Domain Knowledge: The Black-Box, as per your interest.
Combining all the above:
Apply the knowledge:
For a more formal way of becoming a data scientist one can look into this post (reproduced below)-
The Essential Skill Set are the basic fundamental skills which every data scientist is expected to know. Traditionally, these can be acquired by undertaking a computer science degree or a statistics degree from an institution. The Stanford & provide a good reference list of courses to undertake. Now some of the courses are relevant while many others are not. For example in Computer Science while one would do good to learn about large scale distributed databases & algorithms but there is no need for learning HCI and UX, or pureplay storage and operating systems, networking, etc. Similarly some statistics courses focus too much on, lets say, "old school statistics" including thousands of ways of hypothesis testing instead of more on machine learning (clustering, regression, classification, etc). So both the streams have many nice to have courses and must have courses for a data scientist (I dare to claim that at present the percentage of must have courses seems to be greater in a traditional Statistics stream than a Computer Science stream). As such one needs to pick the courses wisely.
Or alternatively, one can also look into a number of new Data Science courses that some universities are offering harping on the points I mentioned above. They combine the must have courses from both the traditional statistics and computer science program to impart the 4as well as include courses to develop the in students. The & are good examples of such amalgamation of the requisite courses. A complete list of such courses is presented here- .
The correct program obviously depends on the individual's goal. One of the recent O'Rielly publications titled 'Analyzing the Analyzers' does a very good job in aggregating the various data scientist roles into 4 main categories as per their skills. An individual may therefore select a program as per the category of data scientist he most identifies himself with, as shown below.
- Data Businesspeople are the product and profit-focused data scientists. They're leaders, managers, and entrepreneurs, but with a technical bent. A common educational path is an engineering degree paired with an MBA or the new Data Science programs as mentioned above.
- Data Creatives are eclectic jacks-of-all-trades, able to work with a broad range of data and tools. They may think of themselves as artists or hackers, and excel at visualization and open source technologies. They are expected to have a engineering degree (mostly in statistics or economics) but not much in business skills.
- Data Developers are focused on writing software to do analytic, statistical, and machine learning tasks, often in production environments. They often have computer science degrees, and often work with so-called "big data".
- Data Researchers apply their scientific training, and the tools and techniques they learned in academia, to organizational data. They may have a MS or PhDs in statistics, economic, physics, etc., and their creative applications of mathematical tools yields valuable insights and products.
The skills associated with the 4 main categories, which justify the above mentioned program recommendation, are as below: