As this topic came up a few times this week for discussion at various places, I thought of composing a post on “Data Scientist vs. Data Analytics Engineer”; even though this is not in the list of TODO blog posts.
My personal understanding of “Data Science” (DS):
One who understands the data and business logic and provides predictions by sampling the current business data (also known as “data insights / business insights / data discovery / business discovery”); about the direction in which the business is heading (both good and bad) or where to head by spotting the trends; so that the business can take a right decision on their next steps; such as:
- improving the product/feature based on user interest levels
- driving more users
- driving more clicks/ impressions / conversion / revenue / leads
- user experience
- user retention
In general, “Data Science” is driven by “Data Scientists”; PhD in math, physics, statistics, machine learning or even computer science; Without a PhD in one of these areas, It is unlikely that one can be hired. In one of the recent ACM conferences, a leading online bidding data science hiring manager said in the open Q&A that she can’t hire anyone without a PhD (+ experience).
Data Scientist Qualifications:
- Familiar on “how to use database systems (SQL interface, ad-hoc) esp. MySQL and Hive (at-least)” to begin with
- Java / python / simple map-reduce jobs development, if needed
- Exposure to various analytics functions (over, median, rank, etc.) and how to use them on various data sets
- Mathematics, Statistics, Correlation, Data mining and Predictive analytics (fast to future prediction based on probability & correlation)
- R” and/or “RStudio” (optionally excel, SAS, IBM SPSS, MATLAB)
- Deep insights into (statistical ) data model development (in agile fashion) and in-general self learning model is the best in today’s dynamics; so that it can learn and tune from its own output by combining with performance over the period of time
- Work with (very) large data sets, grouping together various data sets and visualizing them
- Familiar with machine learning and/or data mining algorithms (Mahout, Bayesian, Clustering, etc.)
As there are different qualifications and expertize within data science, one needs to pick the right candidate for the type of role they going to play. For example, if you have a natural language processing (NLP) role; then you may need a different set of skills to match that role. At times, it also depends on the team size; one can be jack of all trades or the roles could be split among multiple teams.
At present, there is a lot of demand for “Data Scientists” in the market; probably one of the leading job roles after “Data Analytics”. Here is the trend for “Data Science” from indeed:
Data Analytics (DA) in general is a logical extension (or just a buzz word) to Data Warehousing(DW), Business Intelligence (BI); which provides complete insights into business data in most usable form. The major difference in warehousing to analytics is, analytics can be real-time and dynamic in most cases; where as warehouse is ETL driven in off-line fashion.
Every business who deals with “data”, must have “Data Analytics”; without analytics in-place; the business is treated as dead man walking without a heart, a soul and a mind.
Data Analytics (Engineer) Qualifications:
- Familiar with data warehousing and business intelligence concepts
- Strong in-depth exposure to SQL and analytic solutions
- Exposure to hadoop platform based analytics solution (HBase, Hive, Map-reduce jobs, Impala, Cascading, etc.)
- Exposure to various enterprise commercial data analytical stores (Vertica, Greenplum, Aster Data, Teradata, Netezza, etc.) esp. on how to store/retrieve data in most efficient manner from these stores.
- Familiar with various ETL tools (especially for transforming different sources of data into analytics data stores), if needed able to make everything (or some critical business features) real-time
- Schema design for storing and retrieving data efficiently
- Familiar with various tools and components in the data architecture
- Decision making skills (real-time vs ETL, using X component instead of Y for implementing Z etc.)
Sometimes, A Data Analytics Engineer also plays the role of data mining on demand as needed; as he has a better understanding of the data than anyone else; and in-general they have to work closely to get better results.
Data Analytics can also be divided or shared between 4 different teams or people (as it is hard to hire a person with a complete skill-set and more over administration is different from development).
- data architect
- database administrator
- analytics engineer and
At present, “Data Analytics” is probably one of the hot jobs (may be Hadoop/Big Data Engineer has taken over by now); Here is the trend for “Data Analytics” from indeed; and it may continue to be “hot” for a while; as most business needs to have data analytics in place.
Even though both “Data Science” and “Data Analytics” look similar in terms of technology domain; but data science is a data consumer within the business unit and solely depends on data provided by data analytics team. More than that; most of the model predictions or algorithms works really well on large data sets due to better probability on bigger data sets ; so the bigger the data; you have much better chance to predict it right and drive the business further; which means both are directly depending on each other. If you have an engineer with both the qualifications, then he can play everything.
Academy: How to Become Data Scientist or Data Analytics Engineer
- Most of the higher degree institutions in US now offers “Data Science” and “Data Analytics” as courses including popular institutions like Berkeley, Stanford, Columbia, Harvard etc.
- Here is a reference link on colleges offering these subjects as courses (may not be complete & accurate, better check the institute directly):
- Here is one more online (instructor-led or video) on the same topic (which covers pretty much everything needed for today’s world):
Data Science – Books
- Here is a nice blog post from Carl Anderson on all freely available data science books and materials