Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] model inference in onnxruntime is toooooo slow #23282

Open
Tian14267 opened this issue Jan 8, 2025 · 1 comment
Open

[Performance] model inference in onnxruntime is toooooo slow #23282

Tian14267 opened this issue Jan 8, 2025 · 1 comment
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. performance issues related to performance regressions

Comments

@Tian14267
Copy link

Tian14267 commented Jan 8, 2025

Describe the issue

I covert bge-reranker-v2-m3 model to onnx 。And run it in GPU. But I find it is toooo slow in onnx inference。
I run this model in torch , and get about 4min in 10000 pairs of sentence。
When I run in onnx, I get almost 1h in same data and same server.
Image

Here is device info when run onnx model:
CPU:
Image

GPU:
Image

My device is GPU(NVIDIA GeForce RTX 4090)

Here is versions:

python  3.10

onnx                              1.17.0
onnx-graphsurgeon                 0.5.2
onnx-simplifier                   0.4.36
onnxruntime-gpu                   1.19.2
torch                             2.5.1

Why onnx model so slow?

To reproduce

Here is My inference code:

class OnnxInference():
    def __init__(self):
        import onnxruntime

        self.max_length = 4096
        device = 'gpu'
        model_path = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
        if device == "cpu":
            self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path))
        elif "gpu" in device:
            providers = ['CUDAExecutionProvider']
            self.onnx_model = onnxruntime.InferenceSession(os.path.join(model_path), providers=providers)
        ###########
        from transformers import AutoTokenizer
        self.tokenizer = AutoTokenizer.from_pretrained("../bge-reranker-v2-m3")

    def inference(self, input_data):
        inputs = self.tokenizer(input_data,
                           padding=True, truncation=True, return_tensors='np', max_length=self.max_length)

        def get_input_feed(input_ids, attention_mask):
            input_feed = {}
            input_feed["input_ids"] = input_ids
            input_feed["attention_mask"] = attention_mask

            return input_feed

        input_feed = get_input_feed(inputs["input_ids"], inputs["attention_mask"])
        outs = self.onnx_model.run(["logits"], input_feed)

        return outs

Here is my covert code (torch 2 onnx):


def covert_to_onnx():
    from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification, \
        is_torch_npu_available
    model_name_or_path = "/data/fffan/01_experiment/03_Bge/bge-reranker-v2-m3"

    device = 'cuda:0'
    input_ids_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))
    attention_mask_np = torch.from_numpy(np.zeros([1, 4096], dtype=np.int64))

    input_ids_tf = input_ids_np.type(torch.int64).to(device)
    attention_mask_tf = attention_mask_np.type(torch.int64).to(device)

    model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path)
    model = model.to(device)
    model.eval()

    onnx_name = "./onnx_model/onnx_fp32/reranker_onnx.onnx"
    torch.onnx.export(model,  # model being run
                      (input_ids_tf,attention_mask_tf),  # model input (or a tuple for multiple inputs)
                      onnx_name,  # where to save the model
                      opset_version=14,  # the ONNX version to export the model to
                      input_names=['input_ids', 'attention_mask'],
                      output_names=['logits'],
                      dynamic_axes={"input_ids": {0:"batch_size",1:"max_length"},  # 批处理变量
                                    "attention_mask": {0:"batch_size",1:"max_length"}
                                    }
                      )

    print("####  转换完成")

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.19.2

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.3

Model File

No response

Is this a quantized model?

No

@Tian14267 Tian14267 added the performance issues related to performance regressions label Jan 8, 2025
@github-actions github-actions bot added the model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. label Jan 8, 2025
@IDEA-V
Copy link

IDEA-V commented Jan 8, 2025

Try to set the execution_mode in SessionOptions to ort.ExecutionMode.ORT_SEQUENTIAL. I don't know why but ort.ExecutionMode.ORT_PARALLEL is very very slow for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. performance issues related to performance regressions
Projects
None yet
Development

No branches or pull requests

2 participants