Notebook Automation in the Cloud

5 min read

Purpose: In this post, we discuss how to setup a custom Python enviroment in AWS with specific permissions needed to run Jupyter Notebooks in an automated, headless manner on a compute instance and schedule of your choosing. In my experience, this is a very flexible, cost-efficient and reliable way to automate repetitive notebook runs.

This post assumes:

  1. Basic familiarity with cloud setup
  2. You already have a working notebook that you want to automate. Notebook development and testing steps preclude this setup.

Without further ado, let’s dive in!


Term Glossary

TermDefinition
EventBridgeAWS Service for scheduling events and triggers
ECRElastic Container Registry, AWS service to save your machine configuration in the form of a Dockerfile
AWS LambdaPython function in the cloud
SageMakerAWS Jupyter Notebook cloud service
DLQDead letter queue

Contents

  1. Background

  2. Birds Eye View

  3. Setup

    1. Configure Run Environment
    2. Add Permissions
    3. Create AWS Setup
    4. Parameterize & Test Notebook
    5. Schedule Notebook
    6. Bonus: Debugging
    7. Bonus: Updates
  4. Conclusion

Background

The beautiful simplicity of UNIX cron jobs — running something that just works in the background without having the think about it. Can we do something like that for our notebooks in the cloud?

This post covers the steps needed to setup scheduled Jupyter notebook runs on a hardware and schedule of your choosing in the cloud. Why? It might be useful for several reasons:

  1. Generating a report based on a daily data feed, possibly with some parameters
  2. Retraining a shadow mode model everyevery day and get the performance report in email
  3. Automating a repetitive analysis.

Having a reliable setup in the cloud means that you don’t have to keep your maching running, worry about the network, something going wrong with your computer etc. With this setup, with SageMaker Batch Processing Jobs, you only pay for the amount of time your notebook runs. Let’s see how it looks!

The entire setup shouldn’t take more than 30 minutes.

Birds Eye View

The system uses Eventbridge to trigger a Lamda function, which spins up a SageMaker Batch Processing Job using an ECR

Setup

This is typically done on a Terminal on SageMaker Studio or a running instance.

Configure Run Environment

There’s a handy library that helps you set this up called sagemaker-run-notebook. Please clone this and then make specific changes you need for your project. I had a custom requirement for my project where I made modification to the original library.

git clone https://github.com/aws-samples/sagemaker-run-notebook.git
# Edit for custom OS installs
# sagemaker_run_notebook/container/Dockerfile
 
# Edit for custom Python library installs
# sagemaker_run_notebook/container/requirements.txt

Check these commits for reference.

Add Permissions

Refer to the note on permissions. The package creates a very basic and minimal role. If your notebook accesses databases, Glue connectors etc. then you need to attach policies to this role. Please configure the policies accordingly.

I would consider adding AWSLambdaBasicExecutionRole to the permissions here. I found that the author has set it up without CloudWatch logging permissions which caused issues for me while debugging.

Another thing to consider is to add a DLQ SNS for the EventBridge to debug failures.

Create AWS Setup

Once you have tweaked the library code to your liking, install it with pip install . in the library folder to install it. Now run these to set everything up on your AWS account:

run-notebook create-infrastructure
run-notebook create-container
# Another way to add Python libraries to the run environment
# run-notebook create-container --requirements requirements.txt

Parameterize & Test Notebook

Now you can test the notebook! Run this and verify CloudWatch logs to ensure everything ran smoothly. The cool thing is that you can specify parameters for each schedule that the notebook should use at runtime. This library uses Papermill under the hood, so to set this up, it is very important to have one cell with your parameters in the notebook that has a tag parameters added to the cell. See the steps on how to do that here. Once you do this, the value provided in -p "python_variable=value" below, will be replaced in the tagged cell at runtime.

# Edit settings
run-notebook run notebook_path \
-p "python_variable=value" \
--instance "instance_type" \
--extra '{"NetworkConfig":{"VpcConfig":{"SecurityGroupIds":["id"], "Subnets":["subnet_id"]}}}' # Custom VPC configuration etc.

Schedule Notebook

If everything went smooth, now you can schedule your notebook. Please refer to the EventBridge CRON expression reference.

# Edit settings
run-notebook schedule --at "cron(0 0 * * ? *)" \
--name schedule_name notebook_path \
-p "python_variable=value" \
--instance "instance_type" \
--extra '{"NetworkConfig":{"VpcConfig":{"SecurityGroupIds":["id"], "Subnets":["subnet_id"]}}}' # Custom VPC configuration etc.

To unschedule,

run-notebook unschedule schedule_name

Bonus: Debugging

If something went wrong, you can download the exact notebook output.

run-notebook list-runs --rule schedule_name # Note failed jobname
run-notebook download jobname # Use failed jobname to download the notebook output

Bonus: Updates

If you decide to change your code on the notebook and want to reschedule it, simply unschedule and reschedule with the name schedule_name.

Conclusion

We learned how to setup a custom Python enviroment in AWS with specific permissions needed to run your Jupyter Notebooks in an automated and headless manner on a compute instance and schedule of your choosing. I hope this setup unblocks you with automating your work and provides you with a powerful framework for setting up custom and frugal Machine Learning workflow automation. Happy experimenting!