Microsoft Trained Small Language Models To Process, Better Understand Search Queries
- by Laurie Sullivan , Staff Writer @lauriesullivan, December 18, 2024
Microsoft improved Bing with new large language models (LLMs) and small language models (SLMs), which the company says helped to reduce latency and cost associated with hosting and running search.
“Leveraging both Large Language Models (LLMs) and Small Language Models (SLMs) marks a significant milestone in enhancing our search capabilities,” the company wrote in a blog post. “While transformer models have served us well, the growing complexity of search queries necessitated more powerful models.”
Microsoft trained SLMs to process and understand search queries more precisely. But one of the key challenges with large models is managing latency and cost. Nvidia TensorRT-LLM has been integrated into its workflow to optimize SLM inference performance.
advertisement
advertisement
One product in which Microsoft uses TensorRT-LLM is “Deep search” — in order to provide the best possible web results to Bing users.
The original Transformer model, Microsoft explains, “had a 95th percentile latency of 4.76 seconds per batch and a throughput of 4.2 queries per second per instance.”
Each consisted of 20 queries, which means a batch of 20 questions or requests processed together by the LLM.
After integrating TensorRT-LLM, Microsoft managed to achieve a 95th percentile latency reduction to 3.03 seconds per batch and increased throughput to 6.6 queries per second per instance.
Latency is the time it takes for the LLM to process a request and provide a response. The 95th percentile means that 95% of the batches were processed in 3.03 seconds or less to significantly speed response time.
This optimization enhanced user experiences by delivering quicker search results and reduces operational costs of running these large models by 57%.
Microsoft said the transition to SLM models and the integration of TensorRT-LLM brought:
- Faster search results: With optimized inference, users can enjoy quicker response times, making their search experience more seamless and efficient
- Improved accuracy: The enhanced capabilities of SLM models allows Microsoft to deliver more accurate and contextualized search results, helping users find the information they need more effectively
- Cost efficiency – by reducing the cost of hosting and running large models, Microsoft can continue to invest in further innovations and improvements, ensuring that Bing remains at the forefront of search technology
Microsoft improved Bing with new large and small language models that the company says helped to reduce latency and cost associated with hosting and running search.
MediaPost.com: Search & Performance Marketing Daily
(4)
Report Post