Machine Learning

Projects I have led or consulted on.

ClinicalBERT

I helped build one of the first language models applied at large scale to electronic health records data. I did my best to write and rewrite the paper to be accessible to a broad audience, hoping to inspire more people to build on our work. It worked! The paper has been cited over 900 times, and the model parameters have been downloaded millions of times on HuggingFace. I have helped hospitals and companies leverage my work to improve operations, efficiency, and modeling of clinical and financial data using this language model.

Paper: https://arxiv.org/abs/1904.05342 (900+ citations)

One Fact Foundation

I raised $350,000+ to build a non-profit foundation. We received an initial grant for 2022--2023 from Columbia University and Stanford Univeristy, and raised $100,000+ in the New York City Five Boro Bike Tour. Using these funds, I led a team that collected the prices from 4,000+ hospitals nationwide and built Payless Health. Further, for our first contract, I trained a resource allocator team that manages $1B AUM in the New York area, who used our open source tools such as ClinicalBERT and hospital prices to select their next insurance policy, saving upwards of $25M in premiums per year. We also ran the above on-the-ground advertising campaign in New York City to advertise price disparity in cesarean sections found using our data.

Websites: payless.health (hospital prices) & onefact.org (foundation)

Large language model and AI research

I have contributed to papers that:

detect pediatric fractures as well as commercial solutions
accurately predict molecular substructures from mass spectrometry data
rigorously evaluate pre-trained embeddings for efficacy on downstreamed tasks
compare pre-trained embeddings versus my PhD thesis work for efficient recommendation of news articles (This work was commissioned by The Browser for use in their recommendation engine)
assess the efficacy of ClinicalBERT and other models in predicting race, ethnicity, and social deteremninants of health from clinical text

Technical Writing

I am proud of the hundreds of thousands of pageviews my writing on machine learning has received:

How does physics connect to machine learning? - tens of thousands of pageviews with an average read time of 10 minutes+.
Variational autoencoder tutorial - this tutorial has been viewed hundreds of thousands of times with an average read time of 8 minutes+. This tutorial elucidates some of the core technology that goes into image generation models such as Midjourney or DALL-E. The GitHub repository I made has over 1,000+ stars!