The project has been financed with funds from the Language Technologies Plan of the Ministry of Economic Affairs and Digital Agenda and the Future Computing Center, an enterprise of the BSC and IBM .
The visualize has been financed with funds from the Language Technologies Plan of the Ministry of Economic Affairs and Digital Agenda and the Future Computing Center, an enterprise of the BSC and IBM .
MarIA, which is the name of the organization, is available at no cost therefore that any developer, company or entity can use it at no cost. Its possible applications range from lyric correctors or predictors, to automatic pistol summarization applications, chatbots, fresh searches, transformation engines and automatic pistol subtitle, among others. These are the WARC resulting from the traverse and archive of the spanish web site, which the BNE keeps, by virtue of the legal deposit law, as documentary heritage. The BSC has been able to use them to train the system thanks to the engagement of both institutions in the Language Technologies Plan .
First massive AI model in the Spanish language
MarIA is a arrange of speech models or, in early words, deep nervous networks that have been trained to acquire an sympathize of language, its vocabulary and its mechanism to express intend and write at an technical level. They manage to work with short and long interdependencies and are able to understand not alone abstract concepts, but besides their context .
The first dance step in creating a model of the speech is to develop a corpus of words and phrases, which will be the basis on which the system will be trained .
59 terabytes ( equivalent to 59,000 gigabytes ) from the Biblioteca Nacional de España web archive were used to create the MarIA principal. subsequently, these files were processed to remove anything that was not grammatical text ( such as page numbers, graphics, endeless sentences, incorrect encodings, duplicate sentences, early languages, etc. ) and lone grammatical texts were saved in the spanish linguistic process, as it is actually used. For this screen and its subsequent compilation, 6,910,000 hours of MareNostrum supercomputer processors were required and the results were 201,080,084 clean documents, occupying a sum of 570 gigabytes of clean textbook without duplications .
This corpus exceeds the size and choice of the principal available today by respective orders of order of magnitude. It is a corpus that will enrich the digital inheritance of spanish and the BNE ’ s own archive and that may be used for multiple applications in the future, such as having a temp image that allows analyzing the development of the linguistic process, understanding the digital society in its set up and, of course, the coach of new models .
once the principal was created, the BSC researchers used neural network technology ( based on the Transformer architecture ), which has shown excellent results in English and was trained to learn to use the speech. Multilayer nervous networks are an artificial intelligence engineering and the trainings consist, among early techniques, in presenting textbook with shroud words to the network, so that it learns to find out which is the hidden word given its context .
This train required 184,000 processor hours and more than 18,000 GPU hours. The models released so far have 125 million and 355 million parameters respectively .
Marta Villegas, project director and BSC text mining group leader, explains the importance of being able to implement new Artificial Intelligence technologies, “ which are completely transforming the field of lifelike terminology process. With this project, we contribute to the country joining this scientific-technical revolution and positioning itself as a full-fledged actor in the computational discussion of spanish language ” .
Alfonso Valencia, BSC Life Sciences department director, argues that “ the BSC ’ mho high Performance Computing infrastructure has proven to be essential for this character of large projects that require both a set of computing and big amounts of data. For us, it is very satisfactory to put technical capacities and expert cognition at the serve of a project with so many repercussions for the position of spanish in the digital society ”.
Read more: Theatre Education & Performance
The Biblioteca Nacional de España, as established by its regulative law, has among its functions “ to promote and support research programs aimed at generating cognition about its collections, establishing spaces for negotiation with inquiry centers ”. With this project, framed in the Language Technologies Plan, the BNE explores new ways of exploiting the data and collections it conserves, and seeks to promote recycle, modern research projects and improve citizens ’ entree to data .
After releasing the general models, the BSC text mine team is working on expanding the principal, with new file sources that will provide textbook with different particularities from those found in web environments, such as scientific publications from the CSIC .
The generation of models trained with text from different languages is besides planned : spanish, Catalan, Galician, Euskera, Portuguese and spanish from Latin America .
BSC and Plan-TL
The BSC is the technical foul office of the Language Technologies Plan ( Plan-TL ) of the Secretary of State for Digitalization and Artificial Intelligence ( SEDIA ). As such, its mission is to facilitate the growth of more competitive linguistic process systems for club, companies and research groups, making both general and specific language models public – for domains such as biomedicine or legal – and releasing sets of text for caravan and evaluate new models .
far information about the Plan – TL : hypertext transfer protocol : //plantl.mineco.gob.es/Paginas/index.aspx
RoBERTa-base model : hypertext transfer protocol : //huggingface.co/BSC-TeMU/roberta-base-bne
RoBERTa-large model : hypertext transfer protocol : //huggingface.co/BSC-TeMU/roberta-large-bne
Read more: what types of Engineering are there?
information repository : hypertext transfer protocol : //github.com/PlanTL-SANIDAD/lm-spanish
source : BSC