← Zurück zum Board

Staff Machine Learning Engineer - Computer Vision & Multi-Modal AI

Unity · San Francisco, CA, USA · AI · Festanstellung

Zeitraum · Sofort - Offen

0
Job$ auf AnfrageBewerben

Beschreibung

The opportunity

We are building the next generation of AI-driven game experiences — generative world models, neural rendering, and multi-modal understanding that turn images, text, and 3D primitives into interactive worlds. As our Staff Machine Learning Engineer, you will be a core technical leader bringing state-of-the-art computer vision and multi-modal models — transformers, diffusion networks, vision-language models (VLMs), and JEPA-style architectures — from research into robust, production-grade systems.

This is a deeply hands-on, high-impact role. You will help define the modeling and deployment strategy, drive architectural decisions across the ML stack, and mentor a team of senior and mid-level engineers. Your work will directly shape the quality, capability, and performance of AI features experienced by billions of players — across cloud, server, and on-device targets.

What you'll be doing

Technical Leadership

  • Help set the technical vision and roadmap for computer vision and multi-modal AI models, spanning transformers, diffusion models, vision-language models, and JEPA-style generative architectures.
  • Drive design and implementation of models for image and video understanding, generation, segmentation, detection, and dense prediction, as well as multi-modal reasoning over images, text, and 3D inputs.
  • Make sound decisions on model architecture, training strategy, data pipelines, and evaluation — balancing quality, capability, latency, and cost across deployment targets.
  • Own the path from research prototype to production: training, fine-tuning, distillation, export, and serving, with deployment spanning cloud GPUs through to efficient on-device inference where the product requires it.

Architecture & Research Translation

  • Collaborate directly with research scientists to translate novel CV and multi-modal model architectures into deployable, well-engineered implementations.
  • Design scalable systems for multi-modal inference that process diverse inputs images, video, text, primitives, and metadata — and produce rich outputs from semantic predictions to pixel-level generation.
  • Track and rapidly adopt breakthroughs across the field: vision-language pretraining and alignment, efficient diffusion (e.g., consistency models, flow matching), efficient attention (e.g., FlashAttention, linear-attention variants), and tokenization/representation learning for vision.
  • Where latency or device constraints demand it, apply compression, quantization, pruning, and knowledge distillation, and work with appropriate runtimes (e.g., TensorRT, ONNX Runtime, CoreML, TFLite) to meet performance budgets.

Team & Cross-Functional Leadership

  • Lead and mentor a team of ML engineers; define engineering best practices, code review standards, and rigorous benchmarking and evaluation methodology.
  • Partner with research, platform engineers, product managers, and runtime teams to align ML capabilities with product roadmaps and target-platform constraints.
  • Champion a culture of measurement: define KPIs for model quality, accuracy, latency, memory, and cost, and ensure the team tracks them rigorously.

What we're looking for

  • 6+ years in ML engineering, with significant depth in computer vision and/or multi-modal modeling.
  • Proven production experience with transformer-based and diffusion-based vision models (e.g., ViT, CLIP/SigLIP-style encoders, Stable Diffusion, DETR/SAM-style architectures)
  • Strong command of the full model lifecycle: data curation, training and fine-tuning, evaluation, and serving at scale.
  • Familiarity with efficient attention, diffusion samplers, multi-modal fusion, and vision-language alignment techniques.
  • Strong Python and modern deep-learning tooling (PyTorch); solid software engineering fundamentals.
  • Track record of technical leadership: setting direction, influencing cross-functional partners, and growing engineers.

You might also have

  • Experience with world-model, video-generation, or neural rendering pipelines (NeRF, 3DGS, or similar).
  • Experience deploying models to constrained or on-device targets, including quantization (INT8/INT4/FP16), pruning, distillation, and runtimes such as CoreML, TFLite, ONNX
  • Familiarity with mobile SoC accelerators (Apple Neural Engine, Qualcomm Hexagon/Adreno, ARM Mali) or compiler stacks such as MLIR, TVM, or XLA.
  • Contributions to open-source ML frameworks or peer-reviewed CV/ML research publications.
  • Background in real-time graphics or game engine pipelines (Metal, Vulkan, OpenGL ES).