A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.

Well in layman's terms, a data scientist is simply someone who works with data. This involves plenty of activities such as sampling and pre-processing of data, model estimation and post-processing (e.g. sensitivity analysis, model deployment, back-testing, model validation).

A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge.

Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization rising alongside the relatively new technology of big data is the new job title data scientist. While not tied exclusively to big data projects, the data scientist role does complement them because of the increased breadth and depth of data being examined, as compared to traditional roles.

The variety of projects that a data scientist may be engaged in is incredibly broad. Here are few examples:

    • Tactical Optimization - improvement of marketing campaigns, business processes, etc
    • Predictive Analytics - anticipate future demand, future events, etc
    • Nuanced Learning - e.g. developing deep understanding of consumer behavior
    • Recommendation Engines - e.g. Amazon product recs, Netflix movie recs
    • Automated Decision Engines - e.g. automated fraud detection, and even self-driving cars
  • More than anything, what data scientists do is make discoveries while swimming in data. It is their preferred method of navigating the world around them. At ease in the digital realm, they are able to bring structure to large quantities of formless data and make analysis possible. They identify rich data sources, join them with other, potentially incomplete data sources, and clean the resulting set. In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data.


    1. Statistics Expert:
      Someone who has to have the ability to determine the most appropriate statistical techniques for addressing different classes of problems, apply the relevant techniques, and translate the results and generate insights in such a way that the businesses can understand the value. This will be predicated on a thorough understanding of statistical (e.g., regression analysis, cluster analysis, and optimization techniques) techniques and the tools and languages used to run the analysis (e.g., SAS or R) at least a basic understanding of statistics is vital as a data scientist.

    2. Programming Expertise:
      Someone who has the ability to determine the appropriate software packages or modules to run, the ability to modify them, and the ability to design and develop new computational techniques to solve business problems (e.g., machine learning, natural language processing, graph/social network analysis, neural nets, and simulation modelling). Invariably, the data scientist would have a computer science background and be comfortable designing and programming in a variety of languages including Java, Python, C++ or C#.

    3. Multi-variable Calculus and Linear Algebra:
      Understanding these concepts is most important at companies where the product is defined by the data and small improvements in predictive performance or algorithm optimization can lead to huge wins for the company.

    4. Database Technology Expert:
      Someone who has a thorough understanding of external and internal data sources, how they are gathered, stored, and retrieved. This will enable the data scientist and by extension, the business as a whole-to extract, transform and load data stores, retrieve data from external sources (through screen scraping and data transfer protocols), use and manipulate large big data stores (like Hadoop, Hive, Mahoot and an entire range of emerging Big Data technologies) and use the disparate data sources to analyze the data and generate insights.

    5. Data Visualization & Communication:
      This is important because it enables those who are not professional data analysts to interpret data. Visualizing and communicating data is incredibly important, especially at young companies who are making data-driven decisions for the first time or companies where data scientists are viewed as people who help others make data-driven decisions.

      Accordingly, the data scientist should be able to take statistical and computational analysis and turn it into graphs, charts, and animations, create visualizations (e.g., motion charts, word maps) that clearly show insights from data and corresponding analytics, generate static and dynamic visualizations in a variety of visual media (e.g., reports, screens-from mobile screens to laptop/desktop screens to HD large visualization walls, interactive programs, and perhaps soon-augmented reality glasses).

      When it comes to communicating, this means describing your findings or the way techniques work to audiences, both technical and non-technical. Visualization wise, it can be immensely helpful to be familiar with data visualization tools like ggplot and d3.js. It is important to not just be familiar with the tools necessary to visualize data, but also the principles behind visually encoding data and communicating information.