Code

Replicating Bash Argument Splitting

Sometimes it really is just a Small Matter of Programming

December 28, 2021

A few weeks ago I was writing a macro loader for my JSON Processor, and I ran into an odd case where code I entered in the terminal was splitting differently to code I loaded from a file. It turns out that Bash’s word splitting behaves differently for arguments than it does for variables.

When processing args Bash will not split quoted words that contain a delimiter:

printf "%s\n" foo 'bar baz'
foo
bar baz

But if those args are in a string:

input='foo "bar baz"'
printf "%s\n" "$input"
foo "bar baz"

Or unquoted:

printf "%s\n" $input
foo
"bar
baz"

What I really want is to split the variable into an array of words. Then printf works as I want it to:

input=(foo "bar baz")
printf "%s\n" "${input[@]}"
foo
bar baz

I can’t use read -a because that will split input ignoring quotes. And launching a subshell to another program (awk, xargs) is too slow for my use case.

Bash’s argument processing is hardcoded in its parser: modifying IFS has no effect. This is also a hint for the first solution, just use eval:

input='foo "bar baz"'
eval "printf \"%s\n\" $input"
foo
bar baz

But using eval invites a whole set of complications I’d rather avoid.

Bash Quoting

Bash splits arguments on unescaped spaces, horizontal tabs and newlines. So to determine when to split a word we need to understand Bash escape rules. There are four(ish) kinds:

Splitting words like arguments

Decoding and interpreting ANSI-C escape sequences seems difficult, and I don’t need to support it for my use case. To handle the other three kinds, I need to write code which handles each set of escape rules, scan a string and to keep track which kind of quoting is currently in effect.

Here’s what I came up with:

wordsplit () {
  WORDS=()
  WORDC=0
  WORDERR=
  OPTIND=1
  local quo= word= esc=
  while getopts ":" opt "-$1";do
    if [ -z $quo ];then
      if [ "$OPTARG" = '\' ]&&[ -z "$esc" ];then
        esc=1
        continue
      elif ([[ "$OPTARG" == [$' \t\n'] ]]&&[ -z "$esc" ]);then
        if [ -n "$word" ];then
          WORDS+=("$word")
          word=
          (( WORDC++ ))
        fi
        continue
      elif ([ "$OPTARG" = "'" ]||[ "$OPTARG" = '"' ])&&[ -z "$esc" ];then
        quo="$OPTARG"
        continue
      fi
    elif [ "$quo" = '"' ];then
      if [ -n "$esc" ];then
        ! [[ "$OPTARG" == [$'$\\`"\n'] ]] && word+='\'
      elif [ "$OPTARG" = '\' ];then
        esc=1
        continue
      elif [ "$OPTARG" = '"' ];then
        quo=
        continue
      fi
    elif [ "$OPTARG" = "$quo" ];then # single quote term
      quo=
      continue
    fi
    word+="$OPTARG"
    esc=
  done
  if [ -n "$quo" ];then
    WORDERR="found unterminated string"
    return 1
  elif [ -n "$word" ];then
    WORDS+=("$word")
    (( WORDC++ ))
  fi
  return 0
}

This function accepts a single string argument which it splits into the WORDS array (Bash doesn’t really support return values). It inspects the string byte-by-byte using the getopts string split trick. I like this trick because it’s fast and avoids maintaining an index and using substrings.

The first conditional block [ -z $quo ] handles unquoted strings; all word splitting happens outside of quotes, so this is the longest block.

The second top-level block [ "$quo" = '"' ] handles double quotes string escapes. One complication here is only five characters can be escaped, so if we’re in escape mode and the current character is not one of those, the code appends a backslash to the current word which would have been ignored when it was seen in the previous iteration.

The last top-level block [ "$OPTARG" = "$quo" ] just catches the case of a single-quoted string terminator.

Once the loop ends wordsplit checks for the unterminated string error condition, and also a dangling word.

Here it is in action:

wordsplit ' foo bar\ baz	"a b\"c" \d';printf "%s\n" "${WORDS[@]}"
foo
bar baz
a b"c
d

I’ve uploaded the code to GitHub. It has a test suite but as shell quoting is a treacherous business, if you find any bugs please let me know!

Notes

  1. Greg’s wiki page on Bash quoting has advice and examples (the whole wiki is pretty great).

Tags: bash word-splitting awk xargs eval getopts