Hello, we attempt to utilize SpecInfer to accelerate model inference. However, we encounter several performance issues. Specifically, as the batch size increases from 1 to 16, the system throughput gradually improves. But when the batch size reaches 32, there is a significant decline in throughput, which is confusing. Our execution configurations is as follows:
Environment Setup
We use the provided docker image(ghcr.io/flexflow/flexflow-cuda-11.8:latest) and build from source following the docs(https://flexflow.readthedocs.io/en/latest/).
We test two supported models: Llama2-70B and OPT-13B on this dataset(https://huggingface.co/datasets/gbharti/finance-alpaca).
Test Script
We run the model inference following the Quickstart guidance in the repo.
import flexflow.serve as ff
import argparse
import json
import os
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--num_gpus', type=int)
parser.add_argument('--memory_per_gpu', type=int)
parser.add_argument('--zero_copy_memory_per_node', type=int)
parser.add_argument('--tensor_parallelism_degree', type=int)
parser.add_argument('--pipeline_parallelism_degree', type=int)
parser.add_argument('--llm', type=str)
parser.add_argument('--ssm', type=str)
parser.add_argument('--prompts_file', type=str)
parser.add_argument('--max_requests_per_batch', type=int)
parser.add_argument('--max_seq_length', type=int)
parser.add_argument('--max_tokens_per_batch', type=int)
args = parser.parse_args()
ff.init(num_gpus=args.num_gpus,
memory_per_gpu=args.memory_per_gpu,
zero_copy_memory_per_node=args.zero_copy_memory_per_node,
tensor_parallelism_degree=args.tensor_parallelism_degree,
pipeline_parallelism_degree=args.pipeline_parallelism_degree
)
# Specify the LLM
llm = ff.LLM(args.llm)
# Specify a list of SSMs (just one in this case)
ssms=[]
if args.ssm != '':
ssm_names = args.ssm.split(',')
for ssm_name in ssm_names:
ssm = ff.SSM(ssm_name)
ssms.append(ssm)
# Create the sampling configs
generation_config = ff.GenerationConfig(
do_sample=False, temperature=0, topp=1, topk=1
)
# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
ssm.compile(generation_config,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch)
# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config,
ssms=ssms,
max_requests_per_batch=args.max_requests_per_batch,
max_seq_length=args.max_seq_length,
max_tokens_per_batch=args.max_tokens_per_batch
)
# load prompts
with open(args.prompts_file, 'r') as f:
prompts = json.load(f)
llm.start_server()
result = llm.generate(prompts=prompts)
Test Results
We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the throughput when batch size increases from 1 to 32. The results are as follows:
| throughput(tokens/s) |
Llama2-70B |
OPT-13B |
| BS=1 |
28.709671931 |
97.12122162 |
| BS=2 |
52.22124339 |
189.1327599 |
| BS=4 |
106.9214668 |
362.0640686 |
| BS=8 |
182.9473744 |
680.4388029 |
| BS=16 |
322.7966769 |
1188.828348 |
| BS=32 |
298.8251763 |
437.7545888 |
Any help to solve this issue is appreciated!
Hello, we attempt to utilize SpecInfer to accelerate model inference. However, we encounter several performance issues. Specifically, as the batch size increases from 1 to 16, the system throughput gradually improves. But when the batch size reaches 32, there is a significant decline in throughput, which is confusing. Our execution configurations is as follows:
Environment Setup
We use the provided docker image(ghcr.io/flexflow/flexflow-cuda-11.8:latest) and build from source following the docs(https://flexflow.readthedocs.io/en/latest/).
We test two supported models: Llama2-70B and OPT-13B on this dataset(https://huggingface.co/datasets/gbharti/finance-alpaca).
Test Script
We run the model inference following the Quickstart guidance in the repo.
Test Results
We run the evaluation on 4 NVIDIA 80-GB A100 GPUs connected over NVLink, and record the throughput when batch size increases from 1 to 32. The results are as follows:
Any help to solve this issue is appreciated!