In the wave of generative AI, most language learning apps are still stuck at the stage of "connecting to large language model APIs", but Speak, a language learning service invested by OpenAI Startup Fund, is clearly going to take a different path.
Recently, the Speak technical team has explained the major evolution of its underlying architecture: First, it has fully embraced...Agentic Engineering ProcessSecondly, there is Automatic Speech Recognition (ASR), which combines phonetic models."Matching v2" voice matching technologyThis time, we won't talk about how user-friendly the product interface is, but rather take a technical approach to see how Speak is redefining the software development process in the AI era and how it is overcoming the challenges of speech recognition in learning scenarios.
Proxy Engineering: A Paradigm Shift in Development Thinking
Speak's concept of "agent engineering" is not just about having engineers write programs using the AI code editor Cursor, but about regarding AI Agents as the core collaborative unit in the development process.
Task-oriented AI system design
Speak believes that the era of "engineers writing every line of code by hand" in traditional software development is over. In their practice, the development process shifts to orchestration, breaking down complex system functions into multiple AI agents with specific task capabilities.
For example, when developing a new course feature, it is not done by a single engineer, but rather through "Agent Teams" in parallel: some agents are responsible for front-end components, while others are responsible for logic verification, and coordination is carried out through natural language.
"Contextual engineering" becomes a core competency
In Speak's engineering philosophy, the upper limit of an AI agent's capabilities depends on the quality of its environmental context. Therefore, their practical focus is on building an "AI-friendly" repository (Repo Readiness), which includes automated documentation indexing, standardized API declarations, and a sandboxed execution environment.
This "context-first" development logic allows AI to more accurately and autonomously fix vulnerabilities or generate prototypes, thereby significantly shortening the overall development cycle from conception to deployment.
Matching v2: Addressing the inherent limitations of speech recognition
If "proxy engineering" is a powerful tool for backend development, then "Matching v2" is the technological cornerstone of Speak's core product strength.
Dual-track system of automatic speech recognition and speech model
Traditional automatic speech recognition has a fatal flaw in language learning: it is designed to "understand semantics" rather than "correct pronunciation." When learners pronounce words incorrectly (e.g., pronouncing "They" as "Day"), powerful automatic speech recognition models often use a language model to "automatically correct" the pronunciation, directly outputting the correct word. This makes it impossible for the system to detect the user's pronunciation error.
Speak's solution is to introduce speech models to directly convert audio into IPA (International Phonetic Alphabet) sequences:
• Automatic speech recognition is responsible for the semantic layer:Determine what the user "wants to say".
• The phonetic symbol model is responsible for the physical layer:Records what sound the user actually made.
Through a forced alignment algorithm, the system can perform a mathematically optimal match between the standard phonetic transcription of the target sentence and the phonetic transcription actually pronounced by the user. This implementation successfully solves the problem of homophones and near-synonyms such as "Four candles" and "Fork handles".
The Engineering Evolution from "Bag of Words" to "Sequence Matching"
In Matching v1, Speak used a simpler "bag of words" model, where a match was triggered whenever a word spoken by the user appeared in the target sentence. However, in Matching v2, the technical team switched to Sequential Matching.
This involves more stringent real-time challenges. Speak chose to optimize Transformer architecture models like Wav2vec2 to support Streaming Inference. The system updates the matching state every 200-300 milliseconds. This approach not only enhances the correctness of word order (e.g., distinguishing between "Man bites dog" and "Dog bites man"), but also significantly reduces false positives.
The challenge in practice: balancing accuracy and tolerance for error
In his technical presentation, Speak pointed out that the biggest challenge for AI systems lies in balancing "false negatives" and "false positives." If the matching is too strict, users will feel frustrated, but if it is too lenient, the learning process will be lost.
Through the collaboration of automatic speech recognition and phonetic model, Speak has further reduced the false alarm rate by about 40% while maintaining the same false alarm rate. This means that the system has become smarter—it can detect subtle flaws in your pronunciation, but at the same time, it can also determine whether you have basic communication skills.
Analysis of viewpoints
Speak's practical experience shows that the differentiation of future AI services will no longer lie in who uses the stronger model (everyone may eventually use Claude or GPT), but in the engineering capabilities of domain expertise.
Speak accelerates feature iteration through proxy engineering and establishes a formidable technical barrier through a dedicated voice matching process. This approach, which deeply integrates "task orientation" into the development process and core product algorithms, may be a practical model that Taiwanese development teams can learn from in the AI era.





