[Performance] model inference in onnxruntime is toooooo slow #23282

Tian14267 · 2025-01-08T02:05:39Z

Describe the issue

I covert bge-reranker-v2-m3 model to onnx 。And run it in GPU. But I find it is toooo slow in onnx inference。
I run this model in torch , and get about 4min in 10000 pairs of sentence。
When I run in onnx， I get almost 1h in same data and same server.

Here is device info when run onnx model:
CPU:

GPU:

My device is GPU(NVIDIA GeForce RTX 4090)

Here is versions:

python  3.10

onnx                              1.17.0
onnx-graphsurgeon                 0.5.2
onnx-simplifier                   0.4.36
onnxruntime-gpu                   1.19.2
torch                             2.5.1

Why onnx model so slow?

To reproduce

Here is My inference code:

class OnnxInference():
    def __init__(self):
        import onnxruntime

        self.max_length = 4096
        device = 'gpu'
        model_path = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
        if device == "cpu":
            self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path))
        elif "gpu" in device:
            providers = ['CUDAExecutionProvider']
            self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path), providers=providers)
        ###########
        from transformers import AutoTokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("../bge-reranker-v2-m3")

    def inference(self, input_data):
        inputs = self.tokenizer(input_data,
                           padding=True, truncation=True, return_tensors='np', max_length=self.max_length)

        def get_input_feed(input_ids, attention_mask):
            input_feed = {}
            input_feed["input_ids"] = input_ids
            input_feed["attention_mask"] = attention_mask

            return input_feed

        input_feed = get_input_feed(inputs["input_ids"], inputs["attention_mask"])
        outs = self.onnx_model.run(["logits"], input_feed)

        return outs

Here is my covert code (torch 2 onnx):


def covert_to_onnx():
    from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, \
        is_torch_npu_available
    model_name_or_path = "/data/fffan/01_experiment/03_Bge/bge-reranker-v2-m3"

    device = 'cuda:0'
    input_ids_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))
    attention_mask_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))

    input_ids_tf = input_ids_np.type(torch.int64).to(device)
    attention_mask_tf = attention_mask_np.type(torch.int64).to(device)

    model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)
    model = model.to(device)
    model.eval()

    onnx_name = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
    torch.onnx.export(model,  # model being run
                      (input_ids_tf,attention_mask_tf),  # model input (or a tuple for multiple inputs)
                      onnx_name,  # where to save the model
                      opset_version=14,  # the ONNX version to export the model to
                      input_names=['input_ids', 'attention_mask'],
                      output_names=['logits'],
                      dynamic_axes={"input_ids": {0:"batch_size",1:"max_length"},  # 批处理变量
                                    "attention_mask": {0:"batch_size",1:"max_length"}
                                    }
                      )

    print("####  转换完成")

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.19.2

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.3

Model File

No response

Is this a quantized model?

No

The text was updated successfully, but these errors were encountered:

IDEA-V · 2025-01-08T12:24:16Z

Try to set the execution_mode in SessionOptions to ort.ExecutionMode.ORT_SEQUENTIAL. I don't know why but ort.ExecutionMode.ORT_PARALLEL is very very slow for me.

Tian14267 added the performance issues related to performance regressions label Jan 8, 2025

github-actions bot added the model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] model inference in onnxruntime is toooooo slow #23282

[Performance] model inference in onnxruntime is toooooo slow #23282

Tian14267 commented Jan 8, 2025 •

edited

Loading

IDEA-V commented Jan 8, 2025

[Performance] model inference in onnxruntime is toooooo slow #23282

[Performance] model inference in onnxruntime is toooooo slow #23282

Comments

Tian14267 commented Jan 8, 2025 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

IDEA-V commented Jan 8, 2025

Tian14267 commented Jan 8, 2025 •

edited

Loading