If using raw bag of words or n-grams, why not hash the strings to maybe 2^13-1 slots, with something like MurmurHash3 or with multiple hashes to prevent collisions, then use that sparse vector as input to a deep learning model?
So the parameter would be the number of slots (== number of input units of the deep NN).
And the transformation of the text into bag-of-words / n-grams would not be considered feature-engineering - or at least only 'low level feature engineering' - the higher level features will be learned by the deep network.
I guess one could go lower level still and even do away with bag-of-words / n-grams : limit the text size to e.g. 20000 characters, represent each character value with a numerical value (e.g. its ASCII code point when dealing with mainly English texts) and then simply feed this vector of codepoints to the input layer of the deep network. Given enough input data, it should learn location-invariant representations like bag-of-words / n-grams (or even better ones) itself, right?