On August 12, AI Technology Review noted that a key breakthrough has occurred in the international authoritative machine vision question and answer list VQA Leaderboard: Alibaba Damo Academy has set a new record with an accuracy rate of 81.26%, making AI the first time on “Reading and Understanding” Beyond human benchmarks. After AI surpassed human scores in visual recognition and text understanding in 2015 and 2018, AI has also ushered in a major advance in the field of multimodal technology.

Note: AliceMind of Bodhidharma created the first record of surpassing human beings on VQA Leaderboard
What is rare is that three years ago, it was the AI research team of Bodhidharma Academy that made Chinese AI historically surpass humans in the field of text understanding.
What is VQA?
In the past 10 years, AI technology has maintained rapid development, and AI models have surpassed human levels in many tasks and skills. For example, in the field of games, the reinforcement learning agent AlphaGo defeated the world’s top chess player Li Shishi in 2016; in the field of visual understanding , the convolutional model represented by CNN surpassed human performance in the ImageNet visual classification task in 2015; in the field of text comprehension, in 2018, Microsoft and Ali almost simultaneously made AI reading comprehension surpass the human benchmark in the Stanford SQuAD Challenge.
However, in the visual question answering VQA (Visual Question Answering), a high-level cognitive task involving visual-text multimodal understanding, AI has never made breakthroughs beyond the human level in the past.
“Poetry is intangible painting, and painting is tangible poetry.” Song Dynasty poet Zhang Shunmin once described the connection between language and vision. With the rapid development of deep learning, visual understanding, text understanding and other fields, the integration of natural language technology and computer vision has gradually become an important frontier research direction in the multimodal field. Among them, VQA is a very challenging core task in the multimodal field, and solving the VQA challenge is of great significance to the development of general artificial intelligence.
In order to encourage overcoming this problem, CVPR, the top global computer vision conference, has held the VQA challenge for six consecutive years since 2015, attracting the participation of many top institutions including Microsoft, Facebook, Stanford University, Alibaba, Baidu, etc., forming an international scale. The largest and most recognized VQA dataset, which contains over 200,000 real photos and 1.1 million test questions.
VQA is one of the most difficult challenges in AI.In the test, the AI needs to generate the correct natural language answer based on the given picture and natural language question. This means that a single AI model needs to integrate complex computer vision and natural language technologies: first, scan all image information, then combine the understanding of text problems, use multi-modal technology to learn the correlation of graphics and text, and accurately locate relevant image information , and finally answer the questions based on common sense and reasoning.
In June this year, Alibaba Dharma Academy won the championship among the 55 teams that submitted the VQA 2021 Challenge, leading the second place by about 1 percentage point and last year’s champion by 3.4 percentage points. Two months later, Dharma Academy once again set the VQA Leaderboard global record with an accuracy rate of 81.26%, surpassing the human baseline of 80.83% for the first time.
This result means that the AI’s VQA performance in closed datasets is comparable to that of humans.
In the face of a more open real world, AI will definitely encounter new challenges and need to feed more data and further improve the model. However, like the development of CV and other fields, this result is still iconic, and it is only a matter of time before the performance of VQA technology improves in reality.
How was the VQA score superhuman born?
The core difficulty of the VQA challenge is that it is necessary to integrate multi-modal information for joint reasoning and cognition on the basis of single-modal accurate understanding, and finally achieve cross-modal understanding, that is, to do semantic mapping and mapping of different modalities in a unified model. Align.
It is understood that in order to solve the VQA challenge, the DAMO Academy Language Technology Laboratory and Vision Laboratory have systematically designed the AI visual-text reasoning system, incorporating a large number of algorithm innovations, including:
-
Diversified visual feature representation, depicting the local and global semantic information of the image from various aspects, and using Region, Grid, Patch and other visual feature representations for more accurate single-modal understanding; -
Multimodal pre-training based on massive graphic data and multi-granularity visual features for better multimodal information fusion and semantic mapping, innovatively proposed SemVLP, Grid-VLP, E2E-VLP and Fusion-VLP and other pre-trained models; -
Develop self-adaptive cross-modal semantic fusion and alignment technology, innovatively add Learning to Attend mechanism to multi-modal pre-training model to efficiently and deeply integrate cross-modal information;
4. Knowledge-driven multi-skill AI integration using Mixture of Experts (MOE) technology.
Among them, the self-developed multimodal pre-training model E2E-VLP and StructuralLM have been accepted by the top international conference ACL2021.
The Links: HS1470241-A05 3HAC020466-001