Notebooks are HOT and are popping up everywhere. They are a product of the Literate Programming paradigm that combines code, tables, charts, and documentation into one rich, shareable document. They are perfect for prototyping, building reports, and analyzing data, but as the technology has matured, they have become part of many other use cases.
Under the covers, a notebook is essentially a Read-Evaluate-Print Loop (REPL) that creates an interactive environment for quick, easy development. There is no compiling, no building, and most importantly, no waiting. Jupyter and Zeppelin are examples of open source notebooks that are widely used.
Notebooks have traditionally been used for scientific development. They allow for fast failure, quick iteration, and enable reproducibility, since they store all of the implementation details. They've become the tool of choice for data scientists, due to their support of languages like Python and R, plus the ability to render data visuals.
At Talroo we have increasingly been using notebooks for both data science and data engineering work. Additionally, taking the lead from Netflix, instead of taking our code out of the notebook for production use, we have kept them as part of production systems.
As a data engineer, I am going to focus on the data engineering side of things a little bit. The bread-and-butter of any data engineer is the Extract-Transform-Load (ETL) process. There are a lot of open source products (such as Spotify’s Luigi or Apache Airflow) on the market to make the ETL process smoother. However, even though they make writing/orchestrating the ETL process much easier, the additional infrastructure and maintenance cost may not be worth it.
At Talroo, we deal with a LOT of data and we use Apache Spark as our primary data processing platform. We partner with Databricks to manage our Spark ecosystem. They essentially combine a managed Spark backend, fronted by notebooks to make the process of developing Spark applications more interactive. Databricks allows you to write your code, attach a schedule to it, and boom – you have a production job. Additionally, through some of Databricks’ add-ons, you can call and run them (with parameters) from your current notebook. We use this for combining common logic into a notebook and just running it with different parameters. All these features combined allow us to do the complex ETL orchestration that Luigi and Airflow simplify, with none of the additional management.
Notebooks (specifically from Databricks) allow us to focus on the task at hand instead of some of the more tedious aspects of software development. For example, when a notebook is run in Databricks, the output of that notebook is stored forever, including all of your print statements, charts, etc. Prints and displays are now kept and essential for debugging and logging. This provides out-of-the-box rich logging capabilities all handled by Databricks.
Our basic data warehouse ETL workflow looks like this:
A Master notebook that defines the main flow of our ETL and runs multiple instances of Slave notebooks, in this case, notebooks to create dimensions and facts tables. Only the Master notebook needs to be scheduled to run in Databricks.
With this approach, there is no additional infrastructure to set up like there would be for something like Luigi or Airflow. Databricks saves a copy of all the notebooks after running, including all the links between them. This allows us to visit a past run and see details, logs, timing, and charts related to the processed data. It is also painless to move between development and production since the same assets – the notebooks – are used in both.
Notebooks are indeed a very powerful tool to add to our developer’s toolbox. Stay tuned for future posts on how we further mature our notebook-based ETL.