The Sell Pitch : My Unconventional Path to Becoming a Performance Obsessive

My Unconventional Path to Becoming a Performance Obsessive

Let me take you back to 2014 Beijing - a kid from a no-name university stood in Baidu's headquarters holding a first-place hackathon trophy from Robin Li himself, surrounded by Tsinghua and Peking University graduates. That was my first lesson: raw engineering instinct can outshine pedigree.

Three pivotal moments define my technical DNA:

1️⃣ The POS Machine That Couldn't
At Meituan (China's Uber), I faced an Android POS device so underpowered it choked on QR scans. When Java optimizations failed, I went nuclear: ARM assembly hand-tuning → GPU shader offloading → real-time priority threading. The result? 400ms→40ms latency. This obsession with squeezing performance from constrained systems became my trademark.

2️⃣ The Firefox Reality Miracle
Fast forward to 2020: A VR startup handed me a "hopeless" Firefox Reality build with glitching OpenGL contexts. While others debated architectural changes, I spelunked through 1.2M lines of C++ to find the single-threaded EGL binding mismatch in WebRender. Three days. One strategic mutex. Crisis averted.

3️⃣ The Architecture Whisperer
At UCR's high-performance computing course, I dissected Intel's microarchitectures like a circuit surgeon. My breakthrough came when I reimagined Gaussian elimination (GEPP) through the lens of cache line dancing and speculative execution - achieving 200% speedup over MKL's optimized routines through:
• 4D tiling that played perfectly with Haswell's L3 slice partitioning
• AVX-512 register choreography eliminating pipeline bubbles
• Novel partial pivoting that reduced branch mispredictions by 63% • Developed a hybrid recursion/blocking strategy that doubled MKL's performance
• Engineered cache-oblivious permutation that outsmarted hardware prefetchers
• Reduced TLB misses through page-aligned memory sculpting This wasn't just beating BLAS - it was understanding CPU souls.

Why This Matters for Matroid
My journey through Android's underbelly, system programming, and GPU microarchitecture gives me a unique lens for performance engineering. I see what others miss:

How real-time scheduling principles could optimize vision pipelines
Where tensor ops might benefit from VR-grade memory optimization tricks
When to bypass frameworks and speak directly to the metal (hello CUDA graphs!)

The ADHD brain that got me in trouble for hacking local websites at 14(don't worry, I have a GPA of 4.0/4.0 and very friendly and helpful? It's now hyper-focused on pattern-matching system bottlenecks. I don't just write code - I converse with compute architectures.

Let's discuss how my blend of scrappy systems experience and modern ML accelerator knowledge could help Matroid push the boundaries of efficient vision inference. I bring more than skills - I bring a compulsion to make silicon sing.