I am a Lead Solutions Architect at Databricks, where I've spent the last five years advising customers ranging from startups to Fortune 500 enterprises. I also help lead a team of field ambassadors for streaming products and I'm interested in improving industry awareness of effective streaming patterns for data integration and production machine learning. I used to work as a software engineer doing networking automation.

Here are some of my perspectives on data architecture, data processing, and data science:

Real-Time Machine Learning

Real-Time Machine Learning
Presented at ODSC Europe 2023

General overview of tradeoffs involved in designing a production grade ML training/scoring system with event streaming. Stresses the importance of skills that intersect between data engineering, streaming, and data science.

Text to Insights
Presented at Generative AI Conference 2023

Existing SOTA techniques and limitations of text2sql with metadata augmentation, and directions for ensuring data quality and freshness with data engineering and streaming techniques

Real-Time Embedding Clustering
Published in ODSC Blog

Reference architecture and code for solving common outlier detection problems like fraud detection using embeddings. Some considerations for performing vector operations to analyze tabular data in a Spark Structured Stream.

Data Architecture and Use Cases

To Improve Data Availability, Think Right-Time (Datanami)
Guest published article in Datanami.com

Clearing up the hype by describing why event-driven data processing is important regardless of the old debate around "real-time" and "near real-time." I also get into detail around use case prioritization and how to think about data source characteristics

Data Ingestion, Fast and Slow
Presented at Data & AI SUmmit 2023

Architectures that can move between batch and incremental processing without changing the storage and API allow us to solve common data trust problems, such as stale data, as well as production AI/ML risks, such as concept drift.”

Delta Live Tables (E-book)
E-book, 2021

End-to-end overview of Delta Live Tables core concepts.

Data Architecture

How to Perform ETL on 1 Billion EDW Records for Under $1
Published in Databricks Blog

We present record breaking results for the TPC-DI bench, and the performance techniques used.

Real-Time Data Warehousing
Presented at Data & AI Summit, 2022

Performance optimization techniques for processing data for a relational data warehouse. I go over the tradeoffs between cost, latency, and accuracy that all data/AI problems necessitate

SmartSQL Queries using Delta Engine
Data Lab Podcast, 2020

Basic performance optimization and design patterns for data processing and ML training

Other

Course: Data and AI from First Principles
Free course curriculum

Course I created and teach at Databricks

Databricks Infrastructure Automation
Published in Databricks Blog, 2019

Built a tool that automated cloud deployments; introduced Terraform to customers when IaC was an early concept

async-recurse (JS library)
Open Source contribution

Asynchronously traverse a tree or graph in JavaScript with Promises