9 ene. 2017

A two-sided academic landscape: snapshot of highly-cited documents in Google Scholar (1950-2013)

Alberto Martin-Martin, Enrique Orduna-Malea, Juan M. Ayllón, Emilio Delgado López-Cózar
 A two-sided academic landscape: snapshot of highly-cited documents in Google Scholar (1950-2013) 
Revista Española de Documentación Científica, 39(4): e149
DOI 10.3989/redc.2016.4.1405
Access to the Full Text

OBJECTIVES
The main objective of this paper is to identify the set of highly-cited documents in Google Scholar and define their core characteristics, in order to give an answer to the following research questions:
• Which are the most cited documents in Google Scholar?
• Which is the most frequent document type for these highly-cited documents?
• In what languages are the most cited documents written?
• How many highly-cited documents are freely accessible?
• What are the most common file formats to store these highly cited documents?
• Which are the main providers of these highlycited full text documents?
METHODOLOGY
Sample
64,000 documents published entre 1950-2013 (1000 per year)
Design
A longitudinal analysis was carried out by performing 64 keyword-free year queries from 1950 to 2013 (one query per year). All the records displayed (a maximum of 1,000 per query) were
extracted, obtaining a final set of 64,000 records. 
This process was carried out twice (on the 28th of May, and on the 2nd of June, 2014)
Period analyzed
1950-2013
RESULTS
      Which are the most cited documents in Google Scholar?
The most cited document according to GS is the aforementioned article by Lowry et al, with 253,671 citations (as of May 2014), followed by Laemmly’s article, with 221,680 citations.
Although the ranking is dominated by studies from the natural sciences (especially the life sciences), it also contains many works from the social sciences (especially from economics, psychology and sociology), and also from the humanities (philosophy and history). 
-  Many of the works in this ranking are methodological in nature: they describe the steps of a certain procedure or how to handle basic tools to process and analyse data. This is exemplified by the presence of manuals (statistical, laboratory, research methodology), and works that have become a de facto standard in professional practice
In fact, books are the most common category among the top 1% most cited documents, constituting the 62% (395) of this subsample, followed by journal articles with 36.01% (231). Moreover, the citation average of books (2,700) is higher than that for journal articles (1,700)
Which is the most frequent document type for these highly-cited documents?
- The document type has been identified in 71% (45,440) of the documents sampled, whereas the
typology of the other 29% (18,590) remained unknown.
- Predominance of journal articles (including reviews, letters and notes as well) which represent 51% of the total 64,000 documents (72.3% of the documents with a defined document type). Book and book chapters together also make up a big part of the sample (18%; 11,240 items) while the presence of conference proceedings and other typologies (meeting abstracts, corrections, editorial material, etc.) is merely testimonial (1% each). (Fig. 1)
 In what languages are the most cited documents written?
English dominates over the rest of the languages as the most widely used language for scientific communication in Google Scholar, accounting for 92.5% of all the documents. The second and third places are occupied by Spanish and Portuguese respectively but neither of them reaches even 2% of the total (Fig. 2)
How many highly-cited documents are freely accessible?
A free full-text link is provided for 40% (25,849) of all the highly-cited documents retrieved (Figure 3; top). We can also observe a positive trend through the analyzed period (from 25.93% of documents with free full-text links in the period 1950-1959, to 66.84% in 2000-2009).
- What are the most common file formats to store these highly cited documents?
The most common one isvthe pdf format (86.0% of all full text documents), followed by the html format (12.1%). The remaining identified file formats (doc, ps, txt, rtf, xls, ppt) together only represent 1.9% of the freely available documents. The predominance of the pdf format is patent throughout the entire range of years (Fig. 4)
- Which are the main providers of these highly cited full text documents?
A total of 5,715 different providers of free full-text links to highly cited documents have been found in the sample. However, a group of 35 providers (18 universities; 5 scientific societies; 4 publishers; 2 companies; 2 public administrations; 1 journal; 1 digital library; 1 repository; 1 academic social network) account for more than a third of all the links (37%). If we analyse the top-level domains of the 25,849 links to full text available documents the most frequent are academic institutions (.edu; 23.74%) and organizations (.org; 21.39%)
- Versions


83.17% (53,229) of the documents analyzed have more than one version (Table IV).

CONCLUSIONS

In light of the results obtained, we can conclude that Google Scholar offers an original and different vision of the most influential documents in the academic/scientific environment (measured from the perspective of their citation count). These results are a faithful reflection of the allencompassing indexing policies that enable Google Scholar to retrieve a larger and more diverse number of citations, since they come from a wider range of document types, different geographical environments, and languages.
Therefore, Google Scholar covers not only seminal research works in the entire spectrum of the scientific fields, but also the greatly influential works that scientists, teachers and professionals who are training to become practitioners use in their respective fields. This phenomenon is particularly true for works that deal with new data collecting and processing techniques..


What this study adds

Thanks to the wide and diverse list of sources from which Google Scholar feeds, this search engine covers academic documents in a broader sense, enabling the measurement of impact stemming not only from the scientific side of the academic landscape, but also from the educational side (doctoral dissertations, handbooks) and from the professional side (working papers, technical reports, patents), the last two being areas that haven’t been explored as much as the first one.

4 ene. 2017

Can we use Google Scholar to identify highly-cited documents?

Alberto Martin-Martin, Enrique Orduna-Malea, Anne-Wil Harzing, Emilio Delgado López-Cózar
  Can we use Google Scholar to identify highly-cited documents? 
Journal of Informetrics, 2017, 11(1), 152-163
DOI 10.1016/j.joi.2016.11.008
Access to the Full Text

OBJECTIVES
This paper has two main objectives:
1. Verify whether it is possible to reliable identify the most highly-cited papers in Google Scholar, and indirectly
2. Empirically validate whether citations are the primary result-ordering criterion in Google Scholar for generic queries orwhether other factors substantially influence the rank order
METHODOLOGY
Sample
64,000 documents published entre 1950-2013 (1000 per year)
Design
A generic query through conducting a null query (search box is left blank), filtering only by publication year using Google Scholar’s advanced search function. In this way, we avoided the sampling bias caused by the keywords ofa specific query and by other academic search engine optimisation issues. In order to work with a sufficiently large data sample, a longitudinal analysis was carried out by performing 64 generic null queries from 1950 to 2013 (one query per year). Whereas 2013 was the last complete available year when our data collection was carried out, 1950 was selected becausethis particular year reflected an increase in coverage in comparison to the preceding years
Period analyzed
1950-2013
RESULTS
The overall correlation between the number of citations received by the 64,000 documents and the position they occupied on the results page of Google Scholar at the time of the query is r = −0.67 ( < 0.05). The average annual value of the correlation coefficient is very high (negative values for the correlation are due to position1 being better than position 1000). Fig. 1
- The correlation for the results placed amongst the top 900 positions is r = 0.97 ( < 0.01). However, the correlation obtained for results in the last 100 positions is only r = 0.61 ( < 0.01). the results located in the first 900 positions of each search are displayed in green, while the results in the last 100 positions are shown in red (Fig. 2). In this way we can see clearly how, until approximately the 900th position, the Google Scholar sorting criteria are based largely on the number of citations received by each result. However, after approximately the 900th position, the data show erratic results in terms of the correlation between citations and position (Fig 2.)
- The correlation between the position of a document and the number of versions is low, but significant (r = −0.30; < 0.01).The average correlation per year is slightly higher (r = −0.33; = 0.04). Fig. 6 shows that, despite the wide dispersion of data,there is a slight concentration of documents with between 100 and 300 versions amongst the first 100 rank positions (Fig. 3)
The annual average number of documents in English for results within the first 100 positions is 99.5. Therefore, thepresence of documents in other languages within this range is abnormal. When analysing this same percentage for the documents in the last 100 positions, the results change significantly. The annual average drops to 34.2%. (Fig. 4)



CONCLUSIONS

Significant and high correlation between the number of citations and the ranking of the documents retrieved by Google Scholar was obtained for a generic query filtered only by year. The fact that we minimised the effects of academic search engine optimisation, together with the size of the sample analysed (64,000 documents), leads us to conclude that the number of citations is a key factor in the ranking of the results and, therefore, that Google Scholar is able to identify highly-cited papers effectively. Given the unique coverage of Google Scholar (no restrictions on document type and source), this makes it an invaluable tool for bibliometric analysis.



What this study adds

Google Scholar can be used to reliably identify the most highly-cited academic documents. Given its wide and varied coverage, Google Scholar has become a useful complementary tool for Bibliometrics research concerned with the identification of the most influential scientific works