Llama2-70B Profiling and Pipeline Parallel investigation

1. Comparing vLLM and TRT-LLM Profiles 
2. NVIDIA NCU software  profiling 
3. NVIDIA NSYS profiling 
4. Compare gaps on Llama2-70B iteration characteristics ? 
5. Autotune on LLama2-70B for H100 for maximum throughput ? 
6. Understanding Pipeline Parallelism ? 
7. Undertand the PR from vLLM ? 
8. How to measure bubbles in PP ??