Research, writing, and projects at the intersection of machine learning, systems, and science.
We present a sparse mixture-of-experts architecture that achieves sub-quadratic inference cost on sequences exceeding 128k tokens. By introducing a locality-sensitive routing mechanism that exploits the low-rank structure of attention patterns, our method reduces peak memory by 3.8x while maintaining 98.2% of dense model quality across standard long-context benchmarks. We provide theoretical guarantees on routing stability and demonstrate wall-clock speedups on commodity hardware.
We develop a fully differentiable spectral Navier-Stokes solver that enables end-to-end training of neural surrogate models for turbulent flows. Our approach embeds physical conservation laws directly into the computational graph, allowing gradient-based optimization to respect divergence-free constraints without projection steps. On the Kolmogorov flow benchmark, the resulting surrogates achieve 12x speedup over classical solvers at Reynolds numbers up to 10,000 with bounded error accumulation over 500 rollout steps.
We propose a zero-copy distributed key-value cache architecture for serving large language models across disaggregated GPU clusters. By leveraging RDMA-based memory transfers and a novel page-table abstraction for attention state, our system eliminates serialization overhead during prefill-decode handoffs. Evaluations on a 64-GPU cluster show 2.1x improvement in time-to-first-token and 40% higher throughput compared to state-of-the-art serving frameworks under production trace workloads.