output text network B network A input audio
Figure derived from Chan, et al. 2015