In addition to continuing to collaborate with OpenAI on artificial intelligence models, Microsoft also continues to update its Phi series of small language models.Phi-4-multimodal, adds multimodal processing capabilities that support voice, images, and text, and is available through managed platforms such as Azure AI Foundry, Hugging Face, and Nvidia API Catalog.
compared toPreviously launched Phi-4The newly released version mainly enhances multimodal processing capabilities, and strengthens speech recognition, visual analysis and text inference performance, thereby improving the performance of multi-worker artificial intelligence applications on the device side.
Because it corresponds to a multimodal processing method, unlike previous models, it does not have to convert speech content into text first, and must use an independent visual model to handle image analysis, which will cause a significant delay in overall execution efficiency and may also cause greater memory and other resource loss on the device.
The newly proposed Phi-4-multimodal algorithm directly processes speech, images, and text through a unified neural network architecture, thereby improving data processing efficiency. Furthermore, Phi-4-multimodal boasts 56 billion parameters and supports the processing of 12.8 word-based contexts. It also supports preference optimization, feedback-based reinforcement learning, and boasts safety.
Phi-4-multimodal supports more than 20 languages, including major languages such as English, Chinese, Japanese, Korean, German, and French. Voice supports major languages such as English, Chinese, Spanish, and Japanese. As for the image processing part, it only supports English understanding for the time being.
In addition to Phi-4-multimodal, Microsoft also simultaneously launched the smaller Phi-4-mini, which has only 38 billion parameters and focuses on text content processing. It supports program code generation, mathematical reasoning, long text content processing, etc. It can process 12.8 sets of word content at the same time, and is advertised as having higher reasoning capabilities and command-following performance among small language models of the same scale.



