WizWorks

WizWorks Technology engineering | Solutions for Artificial Intelligence

1 million token context windows in AI models. While technically impressive, larger context sizes do not necessarily tran...
12/03/2026

1 million token context windows in AI models.

While technically impressive, larger context sizes do not necessarily translate into better reasoning.

Computational costs, practical implications, and the future of AI systems may depend more on smarter context management (retrieval, memory, agents) than simply increasing token limits.

"The Universal Runtime Vision: Why We're Not Targeting Mobile (And What We're Building Instead)"NeuroBrix has a long-ter...
05/03/2026

"The Universal Runtime Vision: Why We're Not Targeting Mobile (And What We're Building Instead)"

NeuroBrix has a long-term goal: become the standard runtime for neural network inference. Any model. Any GPU architecture. One engine.

Here's our Phase 3 vision for 2027 โ€” and one deliberately honest decision.

๐Ÿ”+ ๐†๐๐” ๐š๐ซ๐œ๐ก๐ข๐ญ๐ž๐œ๐ญ๐ฎ๐ซ๐ž๐ฌ. NVIDIA, AMD, Intel, Apple Silicon, ARM (Jetson, Snapdragon), and RISC-V. The Prism solver already abstracts hardware into YAML profiles โ€” adding a new GPU family means writing a hardware profile and validating dtype support. The ex*****on engine stays the same.

๐Ÿ—๐Ÿ“% ๐“๐ซ๐ข๐ญ๐จ๐ง ๐ค๐ž๐ซ๐ง๐ž๐ฅ ๐œ๐จ๐ฏ๐ž๐ซ๐š๐ ๐ž. Today, NeuroBrix uses PyTorch ATen as the default dispatch layer, with optional Triton kernels. By 2027, the vast majority of operations will have custom Triton implementations โ€” reducing PyTorch to a weight-loading utility, not a runtime dependency.

๐†๐ซ๐š๐ฉ๐ก ๐๐ž๐›๐ฎ๐ ๐ ๐ž๐ซ. Set breakpoints inside the computation graph. Inspect intermediate tensors at any point. Step through ex*****on op-by-op. Today, we have NBX_TRACE_ZEROS, NBX_TRACE_NAN, and NBX_NAN_GUARD as environment variables for debugging. The graph debugger turns this into a proper interactive tool.

๐’๐ƒ๐Š ๐Ÿ๐จ๐ซ ๐ข๐ง๐ญ๐ž๐ ๐ซ๐š๐ญ๐ข๐จ๐ง๐ฌ. A stable Python API for embedding NeuroBrix in other applications: web services, batch pipelines, and orchestration platforms. Import, load, execute โ€” three calls.

๐Ÿ๐Ÿ“๐ŸŽ+ ๐ฆ๐จ๐๐ž๐ฅ๐ฌ. Comprehensive coverage across diffusion, LLM, multimodal, audio, and video. Every model uses the same .nbx format, runtime, and CLI.

Now โ€” the honest part.

๐–๐ž ๐š๐ซ๐ž ๐ง๐จ๐ญ ๐ญ๐š๐ซ๐ ๐ž๐ญ๐ข๐ง๐  ๐ฆ๐จ๐›๐ข๐ฅ๐ž.

NeuroBrix is built on Python and Triton. These don't run on phones. We will not compile to WASM, ship an iOS framework, or pretend that mobile inference is around the corner for us.

Server-side and edge GPUs (Jetson, ARM servers) are real targets. Apple Silicon Macs are a real target. Phones and browsers are not.

If you need on-device mobile inference, Core ML, TensorFlow Lite, and ONNX Runtime Mobile are better tools. We'd rather point you to the right solution than ship a bad experience.

This is a deliberate technical decision. We believe doing fewer things exceptionally well is more valuable than doing everything poorly.

If Triton gains mobile support or WebGPU matures for real inference โ€” we'll revisit. But we don't chase hype.

Follow: github.com/NeuroBrix/neurobrix

Open source. Apache 2.0. pip install neurobrix

Universal AI Runtime โ€” Execute any model on any hardware - NeuroBrix/neurobrix

What is NeuroBrix?NeuroBrix is a universal deep learning inference engine that allows you to run any model, in any modal...
05/03/2026

What is NeuroBrix?
NeuroBrix is a universal deep learning inference engine that allows you to run any model, in any modality, on any hardware, all through a single engine without requiring model-specific code.

Why is it a game-changer?

Unified ex*****on
No more switching between tools like Ollama, ComfyUI, or vLLM. With NeuroBrix, you can run LLMs, image generation models (such as FLUX), audio, video, and multimodal models from a single interface via the CLI.

Universal format (.nbx)
Models are packaged into .nbx containers that include everything required for deterministic ex*****on.

Hardware intelligence (Prism)
The engine analyses your available hardware โ€” whether NVIDIA, AMD, Intel GPUs, or Apple Silicon โ€” and automatically selects the optimal ex*****on strategy, including multi-GPU distribution, parallelism, or CPU offloading.

Zero assumptions
Unlike other engines, NeuroBrix does not assume whether it is processing text, images, or audio. It only operates on tensors and computational graphs, which makes the system truly universal and highly robust.

pip install neurobrix

https://github.com/NeuroBrix

"NeuroBrix Roadmap: What's Coming in 2026"NeuroBrix launched with support for diffusion models, MoE LLMs, and multimodal...
04/03/2026

"NeuroBrix Roadmap: What's Coming in 2026"

NeuroBrix launched with support for diffusion models, MoE LLMs, and multimodal architectures โ€” all running on a single universal runtime with automatic hardware allocation.

Here's what's coming next.

๐๐ก๐š๐ฌ๐ž ๐Ÿ โ€” ๐…๐จ๐ฎ๐ง๐๐š๐ญ๐ข๐จ๐ง (๐๐Ÿ-๐๐Ÿ ๐Ÿ๐ŸŽ๐Ÿ๐Ÿ”)

LoRA support is our top priority. Loading and applying LoRA adapters at runtime is critical for the community โ€” custom styles, fine-tuned behaviors, and domain-specific models all through the same unified engine.

Multi-hardware validation: we're testing on AMD ROCm (MI100/MI250), Apple Silicon (MPS), and Intel Arc GPUs. Today, NeuroBrix runs on NVIDIA. By mid-2026, it should run on every major GPU vendor.

Audio models: full Whisper support across all model sizes. Same engine, same .nbx container format, same CLI. Import, serve, transcribe.

Built-in profiler: a --profile flag that measures time and memory per operation. See exactly where your model spends its compute.

Community hardware profiles program: submit your GPU config as YAML, help NeuroBrix run on more hardware. We especially need AMD Instinct, Apple M-series, and consumer NVIDIA (RTX 3090/4090) profiles.

๐๐ก๐š๐ฌ๐ž ๐Ÿ โ€” ๐๐ž๐ซ๐Ÿ๐จ๐ซ๐ฆ๐š๐ง๐œ๐ž (๐๐Ÿ‘-๐๐Ÿ’ ๐Ÿ๐ŸŽ๐Ÿ๐Ÿ”)

Quantization: INT4 and FP8, with support for community formats (AWQ, GPTQ). Smaller models, faster inference, same correctness guarantees.

Fused Triton kernels: LayerNorm+Linear, GELU+MatMul fused into single GPU calls. Less memory bandwidth, more compute throughput.

Video models: CogVideoX and other text-to-video architectures. The iterative ex*****on flow already supports temporal denoising loops โ€” the plumbing is there.

KV cache quantization: INT8/FP8 KV cache for longer context windows on limited memory. Critical for running 262K-context models like Qwen3-30B on smaller GPU setups.

Graph visualizer: an interactive web tool that shows what's actually happening inside your model at each step. Every operation, every tensor, every data flow.

Target: 50+ models in the registry by the end of Phase 1, 100+ by the end of Phase 2.

Follow the progress: https://github.com/NeuroBrix/neurobrix

Open source. Apache 2.0.

pip install neurobrix

๐Ÿ“ฆ https://pypi.org/project/neurobrix/
๐Ÿ’ป https://github.com/NeuroBrix/neurobrix

"Prism: How We Automatically Distribute a 105GB Model Across 4 GPUs"You have 4x V100-32G GPUs connected via NVLink. You ...
03/03/2026

"Prism: How We Automatically Distribute a 105GB Model Across 4 GPUs"

You have 4x V100-32G GPUs connected via NVLink. You want to run FLUX.2-dev โ€” a 32B parameter, 105GB model. How do you split it?

Most people spend hours writing custom sharding configs. Prism does it in seconds.

Prism is our automatic hardware solver. It reads the model's memory footprint directly from safetensors headers โ€” without loading a single weight into memory โ€” and scores 11 ex*****on strategies to find the optimal one.

Here's the scoring cascade:

single_gpu (1000) โ†’ single_gpu_lifecycle (900) โ†’ pp_nvlink (800) โ†’ tp (780) โ†’ fgp_nvlink (750) โ†’ pp_pcie (700) โ†’ fgp_pcie (650) โ†’ pp_lazy_nvlink (500) โ†’ pp_lazy_pcie (400) โ†’ lazy_sequential (300) โ†’ zero3 (100)

If the best strategy doesn't fit your hardware, Prism automatically falls through to the next viable optionโ€”no manual intervention.

What makes this different from vLLM's parallelism or DeepSpeed's sharding?

๐Ÿ. ๐‡๐ž๐ญ๐ž๐ซ๐จ๐ ๐ž๐ง๐ž๐จ๐ฎ๐ฌ ๐†๐๐” ๐ฌ๐ฎ๐ฉ๐ฉ๐จ๐ซ๐ญ. Mixed GPU configs โ€” 2x V100-16GB + 2x V100-32GB โ€” work out of the box. Block-level sharding is weighted by each device's available VRAM. Nobody else does this automatically.

๐Ÿ. ๐ˆ๐ง๐ญ๐ž๐ซ๐œ๐จ๐ง๐ง๐ž๐œ๐ญ-๐š๐ฐ๐š๐ซ๐ž ๐ฌ๐ญ๐ซ๐š๐ญ๐ž๐ ๐ฒ ๐ฌ๐ž๐ฅ๐ž๐œ๐ญ๐ข๐จ๐ง. NVLink at 300 GB/s behaves fundamentally differently from PCIe. Prism maintains separate strategies: pp_nvlink vs pp_pcie, fgp_nvlink vs fgp_pcie. Tensor parallelism (tp) is only selected when NVLink is available.

๐Ÿ‘. ๐…๐ข๐ง๐ž-๐ ๐ซ๐š๐ข๐ง๐ž๐ ๐ฉ๐š๐ซ๐š๐ฅ๐ฅ๐ž๐ฅ๐ข๐ฌ๐ฆ ๐Ÿ๐จ๐ซ ๐Œ๐จ๐„. Mixture-of-Experts models (DeepSeek-MoE-16B, Qwen3-30B-A3B) need expert-level distribution, not just layer-level. The fgp strategies distribute experts across GPUs based on memory constraints. This is purpose-built for MoE routing.

๐Ÿ’. ๐‹๐ข๐Ÿ๐ž๐œ๐ฒ๐œ๐ฅ๐ž-๐š๐ฐ๐š๐ซ๐ž ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ. Components are classified as persistent (always in VRAM) or transient (loaded on demand). For diffusion: text encoder runs โ†’ unloads โ†’ denoiser loads โ†’ runs โ†’ unloads โ†’ VAE loads โ†’ decodes. Prism calculates the peak of this sequence, not the sum.

๐Ÿ“. ๐Š๐• ๐œ๐š๐œ๐ก๐ž ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ ๐ž๐ฌ๐ญ๐ข๐ฆ๐š๐ญ๐ข๐จ๐ง. For LLMs, Prism computes: max_tokens ร— num_layers ร— 2 ร— num_kv_heads ร— head_dim ร— dtype_bytes. All from metadata. No trial-and-error OOM testing.

One flag. That's all:

neurobrix serve --model flux2-dev --hardware c4140-4xv100-custom-nvlink

Open source. Apache 2.0.

pip install neurobrix
๐Ÿ’ป https://github.com/NeuroBrix/neurobrix
๐ŸŒ https://neurobrix.es/models

Hocine Benkelaya

"The Numerical Stability Bugs Nobody Tells You About (And How We Fixed Them)"When we built the NeuroBrix DtypeEngine, we...
02/03/2026

"The Numerical Stability Bugs Nobody Tells You About (And How We Fixed Them)"

When we built the NeuroBrix DtypeEngine, we discovered stability issues that most inference engines silently ignore. Here are three that cost us weeks to track down.

๐Ÿ. ๐›๐ฆ๐ฆ ๐ฆ๐ฎ๐ฌ๐ญ ๐ซ๐ฎ๐ง ๐ข๐ง ๐…๐๐Ÿ‘๐Ÿ (๐ง๐จ๐ญ ๐…๐๐Ÿ๐Ÿ”)

PyTorch's AMP classifies batched matrix multiplication (bmm) as an FP16 operation โ€” same category as standard matmul and conv2d. Makes sense for most architectures.

Except T5-XXL. T5's cross-attention produces intermediate values that overflow fp16 range when batched. The result: subtle quality degradation that's invisible unless you diff against the fp32 reference output.

We intentionally deviate from PyTorch's AMP rules and classify bmm as FP32. This one decision fixed all T5-family text encoders โ€” which power FLUX, PixArt-Sigma, and every model using T5 conditioning.

๐Ÿ. ๐‘๐จ๐๐„ ๐ง๐ž๐ž๐๐ฌ ๐…๐๐Ÿ‘๐Ÿ ๐œ๐จ๐ฆ๐ฉ๐ฅ๐ž๐ฑ ๐š๐ซ๐ข๐ญ๐ก๐ฆ๐ž๐ญ๐ข๐œ

Rotary Position Embeddings (RoPE) use polar() and view_as_complex() to compute rotations. In fp16, the complex number precision collapses at long sequences โ€” positions beyond ~4K tokens start blurring together.

We force polar and view_as_complex to FP32. This is why DeepSeek-MoE-16B and Qwen3-30B-A3B run correctly through NeuroBrix on 262K context โ€” the position encoding stays precise end-to-end.

๐Ÿ‘. ๐‘๐ž๐ฌ๐ข๐๐ฎ๐š๐ฅ ๐จ๐ฏ๐ž๐ซ๐Ÿ๐ฅ๐จ๐ฐ ๐ข๐ง ๐๐ž๐ž๐ฉ ๐ง๐ž๐ญ๐ฐ๐จ๐ซ๐ค๐ฌ

In a 32-block transformer, the residual stream accumulates through repeated add operations. In fp16, values can silently hit ยฑ65504 (the fp16 max) โ€” producing Inf, which becomes NaN on the next operation. One NaN propagates through the entire forward pass.

Our DtypeEngine applies post-computation overflow clamping on add and sub operations in fp16. Values are clamped to the representable range before they can produce Inf. This stops the NaN cascade before it starts.

These aren't theoretical problems. They're bugs we hit running real models on V100 hardware. Every inference engine that runs in mixed precision will encounter them โ€” most just don't surface them.

We built NeuroBrix's DtypeEngine to encode these rules explicitly, per-operation, with no silent defaults. The full AMP rule table: FP32 for pow/rsqrt/softmax/layer_norm/bmm/polar, FP16 for mm/conv2d, promote-to-highest for add/mul/cat.

Open source. Apache 2.0.

pip install neurobrix
๐Ÿ’ป https://github.com/NeuroBrix/neurobrix

Hocine Benkelaya
Vladimir WizWorks

Universal AI Runtime โ€” Execute any model on any hardware - NeuroBrix/neurobrix

ยกEl futuro de la IA ya estรก aquรญ! Presentamos NeuroBrix, el motor de ejecuciรณn nativo para IA desarrollado por WizWorks ...
27/02/2026

ยกEl futuro de la IA ya estรก aquรญ!

Presentamos NeuroBrix, el motor de ejecuciรณn nativo para IA desarrollado por WizWorks Oรœ.

Diseรฑado para ser ligero, modular y ultraeficiente, NeuroBrix permite orquestar modelos de lenguaje complejos con total soberanรญa de datos.

El sistema estรก optimizado para la construcciรณn de la prรณxima generaciรณn de agentes inteligentes.

Visรญtanos en: https://neurobrix.es



https://github.com/NeuroBrix/neurobrix

Universal AI Runtime โ€” Execute any model on any hardware - NeuroBrix/neurobrix

14/07/2025

Inicia sesiรณn y lanza tu primera campaรฑa hoy mismo

07/07/2025

Votre message est-il lu ?

Avec des taux dโ€™ouverture allant jusquโ€™ร  98 %, le SMS reste lโ€™un des canaux les plus efficaces pour atteindre vos clients. Pendant que les e-mails se perdent dans les spams et que les rรฉseaux sociaux se battent pour capter lโ€™attention, le SMS arrive directement dans la poche โ€” et est lu en moins de 3 minutes.

Notre plateforme vous permet de profiter de cette puissance avec des campagnes faciles ร  lancer, des statistiques en temps rรฉel et un support en franรงais.
Connectez, fidรฉlisez et obtenez des rรฉsultats. Cโ€™est aussi simple que รงa.

07/07/2025

Is your message getting read?

With open rates of up to 98%, SMS remains one of the most effective ways to reach your customers. While emails get lost in spam folders and social media fights for attention, SMS goes straight to the pocket โ€” and is read in under 3 minutes.

Our platform helps you tap into that power with easy-to-launch campaigns, real-time analytics, and support in Spanish.
Connect, build loyalty, and drive results. Itโ€™s that simple.

Address

Harju Maakond, Kesklinna Linnaosa, Tornimรคe Tn 5
Tallinn
10145

Telephone

+34600778153

Website

https://nordsms.es/, https://wizworks.io/, https://artune.ai/

Alerts

Be the first to know and let us send you an email when WizWorks posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Contact The Business

Send a message to WizWorks:

Share