People typically categorize data science and machine learning work (as if there’s a difference -_-) in terms of supervised and unsupervised learning. But to me that’s a little too reductionist. Data science covers so much ground depending on the industry your working in and what your role is. For example, consider the following responsibilities…

  • All ETL and feature engineering that has to be done to transform data.
  • Optimizing loss functions to ensure predictive accuracy in supervised/unsupervised learning, deep learning, etc.
  • Optimizing an objective function or solving systems of equations (e.g. operations research work).
  • Exploring parameter or predictive distributions as you would in Bayesian statistics, probabilistic modeling, reliability engineering, etc.
  • Stepping into engineering roles involving connecting infrastructure, creating data pipelines, developing libraries, and enforcing standards/best-practices.
  • Visualizing data and results from the above responsibilities.

From my experience, it’s better to categorize data science work in terms of actions (identified above in bold). So instead of defining models as unsupervised/supervised models you have optimizing loss/objective functions and exploring ****distributions associated with the data generation process. So you can categorize these problems as optimization problems. This allows you to separate models from ETL and metric creation work which would be categorized as data transformation problems.

It’s a nicer way to classify the responsibilities of data science and machine learning work, all of which often get lumped into terminology like “models”. So rather than overloading the term “model”, this interpretation directly relates to what a data scientist is doing. Terms like unsupervised and supervised seem irrelevant to me since it’s usually pretty obvious if the data scientist is using an outcome variable or not.

To aggressively abstract this I guess what I’m saying is to classify models based on what you’re doing or the algorithm you’re using rather than the data itself.