Makefiles for Workflow

The make utility is typically used to make it easier maintain source code that needs to be compiled. It also just generally helps streamline any workflow.

A makefile is made up of targets that take that consist of dependencies and commands. Your target is typically a compiled binary, the dependencies are source code, and the commands are the instructions you want to compile your source code into the binary. If you make a modification to any of the dependencies then running make or make target-name will rerun your commands to update the target.

target: dependencies
	commands

We can still use makefiles in situations where we’re not compiling code. Below is a simplified example of how I’m using it in a Data Science like workflow.

We have some Python code that generates data.

# generate.py

import pandas as pd
import numpy as np

N = 10
data = pd.DataFrame({'x': np.random.normal(3, 1, N)})

data.to_csv('./data.csv', index=False)

And some code that computes the maximum likelihood estimate (MLE) from that data (assuming that the data is normally distributed).

# estimate.py

from datetime import datetime
import pandas as pd
import numpy as np
from scipy.optimize import minimize
from scipy.stats import norm

print('Running simulation...')

data = pd.read_csv('data.csv')
x = data['x']

# MLE (minimization problem turned into a maximization problem)
def objective_function(params, x):
    log_likelihood = 0.0
    for value in x:
        log_likelihood += np.log(norm.pdf(value, params[0], params[1]))
    return(-log_likelihood)

bnds = ((None, None), (0.5, None))
result = minimize(fun=objective_function, x0=[0.5,0.5], bounds=bnds, args=(x))
result = dict(result)

with open('./mle_result.txt', "w") as f:
    f.write('Timestamp: {0}\n'.format(datetime.now()))
    for key, value in result.items():
        f.write('{0}: {1}\n'.format(key, value))

print('Data: mean [{0}], sd [{1}]'.format(np.mean(x), np.std(x)))
print('MLE: mean [{0}], sd [{1}]'.format(result['x'][0], result['x'][1]))
print('... complete')

In this case the workflow might involve,

Simulating the model (generating data and finding the MLE)
Querying the data that we generated.
Cleaning up all the files (i.e. the output from the python scripts).

We can wrap all of this up in a makefile.

# Makefile

simulate:
	@python generate.py
	@python estimate.py

preview:
	@head -n "$(nrow)" data.csv

clean:
	@rm ./data.csv
	@rm ./mle_result.txt

So running make simulate will run the python scripts and create the data and MLE result. Running make preview nrow=6 will print the first 6 rows of the data file. And running make clean will remove the data file and MLE result. (The usage of @ is specific to the make utility and prevents the command from being echoed to the standard output.)