Real-Time Machine Learning
Presented at ODSC Europe 2023

General overview of tradeoffs involved in designing a production grade ML training/scoring system with event streaming. Stresses the importance of skills that intersect between data engineering, streaming, and data science.

Text to Insights
Presented at Generative AI Conference 2023

Existing SOTA techniques and limitations of text2sql with metadata augmentation, and directions for ensuring data quality and freshness with data engineering and streaming techniques

Real-Time Embedding Clustering
Published in ODSC Blog

Reference architecture and code for solving common outlier detection problems like fraud detection using embeddings. Some considerations for performing vector operations to analyze tabular data in a Spark Structured Stream.

To Improve Data Availability, Think Right-Time (Datanami)
Guest published article in Datanami.com

Clearing up the hype by describing why event-driven data processing is important regardless of the old debate around "real-time" and "near real-time." I also get into detail around use case prioritization and how to think about data source characteristics

Data Ingestion, Fast and Slow
Presented at Data & AI SUmmit 2023

Architectures that can move between batch and incremental processing without changing the storage and API allow us to solve common data trust problems, such as stale data, as well as production AI/ML risks, such as concept drift.”

Delta Live Tables (E-book)
E-book, 2021

End-to-end overview of Delta Live Tables core concepts.

How to Perform ETL on 1 Billion EDW Records for Under $1
Published in Databricks Blog

We present record breaking results for the TPC-DI bench, and the performance techniques used.

Real-Time Data Warehousing
Presented at Data & AI Summit, 2022

Performance optimization techniques for processing data for a relational data warehouse. I go over the tradeoffs between cost, latency, and accuracy that all data/AI problems necessitate

SmartSQL Queries using Delta Engine
Data Lab Podcast, 2020

Basic performance optimization and design patterns for data processing and ML training

Other

Course: Data and AI from First Principles
Free course curriculum

Course I created and teach at Databricks

Databricks Infrastructure Automation
Published in Databricks Blog, 2019

Built a tool that automated cloud deployments; introduced Terraform to customers when IaC was an early concept

async-recurse (JS library)
Open Source contribution

Asynchronously traverse a tree or graph in JavaScript with Promises