PR types

Performance optimization

PR changes

OPs

Describe

  1. GetLengthLoD, GPUDistFpnProposalsHelper should run on context stream
  2. remove two unnecessary context wait (no data is sent between host and device)
  3. sub_lod_data can be memcpy in batch, reduce multiple times sychronization
  4. The is ~1% e2e performance gain on trt-fp16/maskrcnn inference