TCP_NIO2 has a single selector which looks like this:
This means that any operation (read,write) delays other operations. It seems that especially the de-serialization done in reads delays other reads and writes.
A quick test showed that having a reader thread per connection (so that reads don't delay other reads or writes) improved perf from 15'000 reqs/sec to 22'000.
The idea is to have a reader thread in NioConnection which reads and de-serializes as many messages as possible. When no more messages are ready to be read, it blocks for a max wait time and then terminates unless more messages are ready. This means that idle connections will have no threads allocated.
Investigate: we might possibly also null the pre-allocated buffer when a thread terminates, reducing memory usage even more.