Since entering 2019, an obvious trend is that technology companies in the field of machine vision that were relatively low-key before, such as Shangtang Technology, Megvii Technology, etc., have become high-profile, and have successively released new strategies or new layouts for the new year. Ge Ling Shentong, who “disappeared” for two years, recently came out to talk about his products and strategies.
Combining the two big scenes of security and finance, machine vision companies are living a good life in AI companies. The commercialization process is relatively fast, and there is a need to brand products and technologies.
It is also an area of concern in AI. The limelight of companies in the field of speech recognition is not comparable to that of machine vision companies. The leading company iFlytek even reported layoffs. In addition, the two major speech recognition companies Spiech and Yunzhisheng that were torn apart in 2018 are representative companies in the field of speech recognition, but their valuations are much lower than those of Shangtang and Megvii.
Why is this happening?
This weekly report interprets the field of speech recognition for you.
An overview of the speech recognition industry
1) How is speech recognition implemented technically? First understand the basic idea.
In a speech recognition system, the first is the voice activation detection system. The voice input from the user can activate the voice activation detection system, wake up the subsequent recognition system, and complete the following feature extraction, recognition modeling, model training, decoding, output results, etc. operate. The voice activation detection system can also reduce the silent parts at the front and rear ends of the audio text to prevent interference with the recognition system.
In image recognition, each part of the image needs to be partitioned and clustered before feature extraction. Speech recognition is similar.The voice content needs to be divided into frames by means of moving window function and other means, and each frame is very small in the order of tens of milliseconds, so that the voice is divided into many small segments.After that, each frame of waveform is transformed into a multi-dimensional vector by means of acoustic feature extraction., which uses a multidimensional vector to describe this frame of audio. A large number of multi-dimensional vectors are combined to form an observation sequence, and then this sequence matrix needs to be recognized and converted into text.
The text we use is made up of words,A word is a phoneme set, and each phoneme can be subdivided into several states. For speech recognition, we need to disassemble our commonly used words and describe them by state. When performing speech recognition, it is the biggest difficulty to recognize each frame as a state.Then combine states into phonemes, and then combine phonemes into words to form text that can be read. To associate each frame with the state, it is necessary to train the acoustic model with a large amount of speech data, and calculate the parameters in different environments, so that the model is more and more accurate, and the association between the frame and the state is more and more accurate. This is the more common mode of supervised learning. After that, based on the trained acoustic model, combined with the dictionary and language model, the input speech frame sequence is recognized, which is the decoding process.
2) It can be seen from the above process that a large amount of training data is the basis for the development of speech recognition technology, and more and more accurate algorithms are formed under the feeding of data.The main speech recognition algorithms include dynamic time warping (DTW) algorithms, vector quantization (VQ) methods based on nonparametric models, hidden Markov model (HMM) methods based on parametric models, artificial neural networks (ANN) and Support Vector Machines, etc.Due to the large number of operations in the data training process, high-performance speech recognition chips are also the key to improving the effect of speech recognition. The use of exclusive speech recognition AI chips to process a large number of matrix operations in the recognition stage and achieve operation acceleration has become an important development direction of speech recognition chips.
3) There are a lot of scenarios in our work and life that need to use voice to transmit information. Therefore, through speech recognition technology, the machine can understand human words and then execute human commands. , and people make decisions again. The applicable industry scenarios are very wide, and almost every industry can find a link to apply speech recognition technology. As far as specific aspects are concerned,At present, the main application links of speech recognition include: intelligent customer service, customer service quality inspection, voice input in social tools, voice control in industries such as smart hardware and smart home, text conversion of conferences and interview records, subtitle generation in videos, UGC Voice content identification and review, etc.
BAT has a layout in speech recognition. Baidu’s voice recognition is currently mainly used in transportation, application assistants, smart home, social networking, gaming and entertainment; Alibaba Cloud encapsulates intelligent voice interaction technology, which has many applications in retail, urban management, etc. Promote the implementation of speech recognition technology in IoT scenarios; Tencent Cloud speech recognition has rich experience in customization in vertical fields such as social chat and game entertainment.
5) The field of machine vision has produced four AI dragons,There are also several unicorn companies in the field of speech recognition, such as iFLYTEK, which has already been listed, as well as Yunzhisheng, Spiech, and Mobvoi. The speech recognition application industry is relatively concentrated, mainly in the fields of education, medical care, and smart home.
Founded in 1999, iFLYTEK is a representative company in the voice field. Its main C-side products include iFLYTEK input method, iFLYTEK TV assistant, smart voice assistant Migu Lingxi, iFLYTEK smart speakers, iFLYTEK Hearing, etc.On the B side, there are voice engines, open platforms for voice technology, and solutions for industries such as education, telecommunications, public safety, consumer electronics, and construction. The latest area of entry is the smart court, which has established cooperation with courts in many places to promote solutions such as the intelligent voice trial system.iFLYTEK’s deepest field is the education industrylaunched products such as teaching, examination, practice, children’s smart hardware, and launched smart education and smart campus solutions on the school side, which are currently an important