Improving Bash Generation in Small Language Models with Grammar-Constrained Decoding
Bash is one of the most flexible and powerful interfaces exposed to AI agents. In the right system, a model that emits grep , curl , tar , or a shell pipeline is producing an executable action that can read files, mutate a workspace, open network connections, and chain tools together. For the NVIDIA AI Red Team, this makes command generation a useful research target. If smaller language models can be guided into valid, policy-aware command structures, they become more reliable components for agentic workflows that can be deployed into a wider range of environments. Constrained decoding is a technique that modifies the sampling process in autoregressive language model generation. At each generation step, the model produces logits as normal, but before a token is selected, a grammar is applied to change the distribution (often by effectively blocking certain tokens). PICARD used this technique to improve SQL generation, for example. The AI Red Team applied the same concept to Bash to improve the ability of small models to successfully achieve command-line tasks. This post describes an experimental pipeline for generating Bash command grammars and applying them during decoding. We ran 13 small language models against 299 tasks and improved the average pass rate from 62.5% to 75.2%. The strongest result was on Qwen3-0.6B, where the pass rate increased from 16.7% to 59.2%. Why Bash Agentic systems increasingly use language models to generate code and commands that are executed by tools, shells, notebooks, build systems, and CI jobs. The security challenge isn’t only whether the model “understands” a task. It is whether it can generate a syntactically valid action, scoped to the intended environment, and constrained away from unsafe forms. Bash is a compact example of that problem: - Syntax errors are unforgiving, and risk scales with task complexity. - A valid command can still be operationally dangerous, such as a network command without a timeout or a destructive command with an overbroad path. - Shell composition multiplies the state space. Pipes, redirects, command substitution, heredocs, loops, and conditionals all change what the model must emit and how a grammar would be applied. - Small models often know the root binary to call but fail on exact syntax, argument order, quoting, control operators, or termination. - Bash’s expressiveness and power might make it the only tool needed if the model can be suitably expressive The core research question was: Can constrained decoding improve small-model Bash command reliability enough to make them useful for agentic workflows? Generating grammars Handwriting a grammar for every command is brittle. Bash commands have many flags, aliases, optional values, positional arguments, and syntax variations. Instead, grammargen turns structured command evidence into Lark grammars. The intermediate representation captures the pieces needed for constrained decoding, like: - Command names and aliases. - Boolean short flags and long flags. - Valued flags, such as -A 3 or--max-count=10. - Positional arguments such as paths, patterns, words, and integers. - Bounded repetition to keep the decoding state finite. For example, a generated…

