The Lattes platform has been hosting curricula of Brazilian researchers since the late 1990s, containing more than 5 million curricula. The data from the Lattes curricula can be downloaded to XML
format, the complexity of this reading process motivated the development of the getLattes
package, which imports the information from the XML
files to a list in the R
software and then tabulates the Lattes data to a data.frame
.
The main information contained in XML
files, and imported via getLattes
, are:
getApresentacaoTrabalho
getAreasAtuacao
getArtigosPublicados
getAtuacoesProfissionais
getBancasDoutorado
getBancasGraduacao
getBancasJulgadoras
getBancasMestrado
getCapitulosLivros
getCursoCurtaDuracao
getDadosGerais
getEnderecoProfissional
getEventosCongressos
getFormacao
getIdiomas
getJornaisRevistas
getLinhaPesquisa
getLivrosPublicados
getOrganizacaoEvento
getOrientacoesDoutorado
getOrientacoesMestrado
getOrientacoesOutras
getOrientacoesPosDoutorado
getOutrasProducoesBibliograficas
getOutrasProducoesTecnicas
getParticipacaoProjeto
getPrefacio
getPremiosTitulos
getProducaoTecnica
getProgramaRadioTV
getRelatorioPesquisa
getTrabalhosEventos
From the functionalities presented in this package, the main challenge to work with the Lattes curriculum data is now to download the data, as there are Captchas. To download a lot of curricula I suggest the use of Captchas Negated by Python reQuests - CNPQ. The second barrier to be overcome is the management and processing of a large volume of data, the whole Lattes platform in XML
files totals over 200 GB. In this tutorial we will focus on the getLattes
package features, being the reader responsible for download and manage the files.
Follow an example of how to search and download data from the Lattes website.