Microsoft Trained Small Language Models To Process, Understand Search Query Responses

Hi, Welcome

Microsoft Trained Small Language Models To Process, Better Understand Search Queries

Microsoft Trained Small Language Models To Process, Understand Search Query Responses

Microsoft improved Bing with new large language models (LLMs) and small language models (SLMs), which the company says helped to reduce latency and cost associated with hosting and running search.

“Leveraging both Large Language Models (LLMs) and Small Language Models (SLMs) marks a significant milestone in enhancing our search capabilities,” the company wrote in a blog post. “While transformer models have served us well, the growing complexity of search queries necessitated more powerful models.”

Microsoft trained SLMs to process and understand search queries more precisely. But one of the key challenges with large models is managing latency and cost. Nvidia TensorRT-LLM has been integrated into its workflow to optimize SLM inference performance.

advertisement

advertisement

One product in which Microsoft uses TensorRT-LLM is “Deep search” — in order to provide the best possible web results to Bing users.

Microsoft bought nearly half a million graphic processing units (GPUs) this year to build artificial intelligence (AI) systems, reported the Financial Times. The FT cited analysts at Omdia, a technology consultancy, that estimate 485,000 of Nvidia’s “Hopper” chips this year, out buying even Meta, Nvidia’s biggest customer, which bought 224,000. 

The original Transformer model, Microsoft explains, “had a 95th percentile latency of 4.76 seconds per batch and a throughput of 4.2 queries per second per instance.”

Each consisted of 20 queries, which means a batch of 20 questions or requests processed together by the LLM.

After integrating TensorRT-LLM, Microsoft managed to achieve a 95th percentile latency reduction to 3.03 seconds per batch and increased throughput to 6.6 queries per second per instance.

Latency is the time it takes for the LLM to process a request and provide a response. The 95th percentile means that 95% of the batches were processed in 3.03 seconds or less to significantly speed response time.

This optimization enhanced user experiences by delivering quicker search results and reduces operational costs of running these large models by 57%.

Microsoft said the transition to SLM models and the integration of TensorRT-LLM brought:

  • Faster search results: With optimized inference, users can enjoy quicker response times, making their search experience more seamless and efficient
  • Improved accuracy: The enhanced capabilities of SLM models allows Microsoft to deliver more accurate and contextualized search results, helping users find the information they need more effectively
  • Cost efficiency – by reducing the cost of hosting and running large models, Microsoft can continue to invest in further innovations and improvements, ensuring that Bing remains at the forefront of search technology

    Microsoft improved Bing with new large and small language models that the company says helped to reduce latency and cost associated with hosting and running search.

     

     

    MediaPost.com: Search & Performance Marketing Daily

    (4)

    Report Post