Makefiles for Workflow
2020-09-12
2 minutes
The make utility is typically used to make it easier maintain source code that needs to be compiled. It also just generally helps streamline any workflow.
A makefile is made up of targets that take that consist of dependencies and commands. Your target is typically a compiled binary, the dependencies are source code, and the commands are the instructions you want to compile your source code into the binary. If you make a modification to any of the dependencies then running make
or make target-name
will rerun your commands to update the target.
target: dependencies
commands
We can still use makefiles in situations where we’re not compiling code. Below is a simplified example of how I’m using it in a Data Science like workflow.
We have some Python code that generates data.
# generate.py
import pandas as pd
import numpy as np
N = 10
data = pd.DataFrame({'x': np.random.normal(3, 1, N)})
data.to_csv('./data.csv', index=False)
And some code that computes the maximum likelihood estimate (MLE) from that data (assuming that the data is normally distributed).
# estimate.py
from datetime import datetime
import pandas as pd
import numpy as np
from scipy.optimize import minimize
from scipy.stats import norm
print('Running simulation...')
data = pd.read_csv('data.csv')
x = data['x']
# MLE (minimization problem turned into a maximization problem)
def objective_function(params, x):
log_likelihood = 0.0
for value in x:
log_likelihood += np.log(norm.pdf(value, params[0], params[1]))
return(-log_likelihood)
bnds = ((None, None), (0.5, None))
result = minimize(fun=objective_function, x0=[0.5,0.5], bounds=bnds, args=(x))
result = dict(result)
with open('./mle_result.txt', "w") as f:
f.write('Timestamp: {0}\n'.format(datetime.now()))
for key, value in result.items():
f.write('{0}: {1}\n'.format(key, value))
print('Data: mean [{0}], sd [{1}]'.format(np.mean(x), np.std(x)))
print('MLE: mean [{0}], sd [{1}]'.format(result['x'][0], result['x'][1]))
print('... complete')
In this case the workflow might involve,
- Simulating the model (generating data and finding the MLE)
- Querying the data that we generated.
- Cleaning up all the files (i.e. the output from the python scripts).
We can wrap all of this up in a makefile.
# Makefile
simulate:
@python generate.py
@python estimate.py
preview:
@head -n "$(nrow)" data.csv
clean:
@rm ./data.csv
@rm ./mle_result.txt
So running make simulate
will run the python scripts and create the data and MLE result. Running make preview nrow=6
will print the first 6 rows of the data file. And running make clean
will remove the data file and MLE result. (The usage of @
is specific to the make utility and prevents the command from being echoed to the standard output.)
396 Words