The Sell Pitch : My Unconventional Path to Becoming a Performance Obsessive
My Unconventional Path to Becoming a Performance Obsessive
Let me take you back to 2014 Beijing - a kid from a no-name university stood in Baidu's headquarters holding a first-place hackathon trophy from Robin Li himself, surrounded by Tsinghua and Peking University graduates. That was my first lesson: raw engineering instinct can outshine pedigree.
Three pivotal moments define my technical DNA:
1️⃣ The POS Machine That Couldn't
At Meituan (China's Uber), I faced an Android POS device so underpowered it choked on QR scans. When Java optimizations failed, I went nuclear: ARM assembly hand-tuning → GPU shader offloading → real-time priority threading. The result? 400ms→40ms latency. This obsession with squeezing performance from constrained systems became my trademark.
2️⃣ The Firefox Reality Miracle
Fast forward to 2020: A VR startup handed me a "hopeless" Firefox Reality build with glitching OpenGL contexts. While others debated architectural changes, I spelunked through 1.2M lines of C++ to find the single-threaded EGL binding mismatch in WebRender. Three days. One strategic mutex. Crisis averted.
3️⃣ The Architecture Whisperer
At UCR's high-performance computing course, I dissected Intel's microarchitectures like a circuit surgeon. My breakthrough came when I reimagined Gaussian elimination (GEPP) through the lens of cache line dancing and speculative execution - achieving 200% speedup over MKL's optimized routines through:
• 4D tiling that played perfectly with Haswell's L3 slice partitioning
• AVX-512 register choreography eliminating pipeline bubbles
• Novel partial pivoting that reduced branch mispredictions by 63%
• Developed a hybrid recursion/blocking strategy that doubled MKL's performance
• Engineered cache-oblivious permutation that outsmarted hardware prefetchers
• Reduced TLB misses through page-aligned memory sculpting
This wasn't just beating BLAS - it was understanding CPU souls.
Why This Matters for Matroid
My journey through Android's underbelly, system programming, and GPU microarchitecture gives me a unique lens for performance engineering. I see what others miss:
- How real-time scheduling principles could optimize vision pipelines
- Where tensor ops might benefit from VR-grade memory optimization tricks
- When to bypass frameworks and speak directly to the metal (hello CUDA graphs!)
The ADHD brain that got me in trouble for hacking local websites at 14(don't worry, I have a GPA of 4.0/4.0 and very friendly and helpful? It's now hyper-focused on pattern-matching system bottlenecks. I don't just write code - I converse with compute architectures.
Let's discuss how my blend of scrappy systems experience and modern ML accelerator knowledge could help Matroid push the boundaries of efficient vision inference. I bring more than skills - I bring a compulsion to make silicon sing.