21 dic. 2016

Does Dirty Data Affect Google Scholar Citations? The case of the academic profiles of 11 Turkish researchers

G Doğan, I Şencan, Y Tonta
Does dirty data affect google scholar citations?
ASIST '16.  Proceedings of the 79th ASIS&T Annual Meeting: Creating Knowledge, Enhancing Lives through Information & Technology. Copenhagen, Denmark, October 14-18 2016


The main goal of this study is to find out if Google Scholar citation metrics fluctuate on the basis of presence of duplicate publications and citations in the database. Are addressed the following research question: 

- Does Google Scholar database include duplicate publications and citations in researchers’ profiles?

- If yes, what is the impact of this practice on citation counts and Google Scholar Citations metrics such as h- and i10-index values? 
Answering this question will shed some light on the size of the problem and help us better interpret the rankings and metrics based on GS data.

Are selected the 11 researchers based at Hacettepe University’s Department of Information Management with public GS profiles (January 27, 2016. Are collected and cleaned data between January 27-March 18, 2016.
Are checked Google Scholar profiles of 11 researchers to identify duplicate records for the same publications. Next, are identified the number of different records for each publication and citations thereto as well as singular publication counts for each researcher and combined citation counts for each publication. Are then re-calculated the h- and i10-indexes for each researcher using their new publication and combined citation counts and compared them with Google Scholar Citations metrics


Duplicate Publications
- 14% (n=69) of publications (N=499) were represented with more than one records (mostly 2, max. 5)
- Excluding duplicate records did not reduce the number of citations (only 4 out of 69 publications got affected)
- None of the researchers’ re-calculated h-index was changed and only one researcher’s i10-index has increased by 1
Duplicate Citations
- 135 publications (55%) received a total 364 duplicate citations: 12% of all citations (3,079)
- When duplicate citations removed, citation counts of half of 135 publications decreased by at least two citations
- Citation counts of almost all researchers decreased, some as much as by 20%
- h-indexes of more than half the researchers decreased by at least 1
- i10-indexes of four researchers decreased by 2 and 4, although one researcher’s i10-index increased by 1

Confirming our hypothesis

We can not generalize. The sample is small and skewed (11 Turkish researchers). National, linguistic and disciplinary peculiarities.
Further studies are needed and with larger and more representative samples

Available at

20 dic. 2016

H-index manipulation by merging articles in Google Scholar Profiles: Models, theory, and experiments

R van Bevern, C Komusiewicz, R Niedermeierd, M Sorged, T Walsh
H-index manipulation by merging articles: Models, theory, and experiments
Artificial Intelligence 2016, 240: 9–35

The H-index is a widely used measure for estimating the productivity and impact of researchers, journals, and institutions. Several publicly accessible databases such as AMiner, Google Scholar, Scopus, and Web of Science compute the H-index of researchers. Such metrics are therefore visible to hiring committees and funding agencies when comparing researchers and proposals. 

Although the H-index of Google Scholar profiles is computed automatically, profile owners can still affect their H-index by merging articles in their profile. The intention of providing the option to merge articles is to enable researchers to identify different versions of the same article. This may decrease a researcher’s H-index if both articles counted towards it before merging, or increase the H-index since the merged article may have more citations than each of the individual articles. Since the Google Scholar interface permits to merge arbitrary pairs of articles, this leaves the H-index of Google Scholar profiles vulnerable to manipulation by insincere authors.


1. We propose two further ways of measuring the number of citations of a merged article. One of them seems to be the measure used by Google Scholar.

2. We propose a model for restricting the set of allowed merge operations. Although Google Scholar allows merges be-tween arbitrary articles, such a restriction is well motivated: An insincere author may try to merge only similar articles in order to conceal the manipulation.

3. We consider the variant of H-index manipulation in which only a limited number of merges may be applied in order to achieve a desired H-index. This is again motivated by the fact that an insincere author may try to conceal the manipulation by performing only few changes to her or his own profile.

4. We analyze each problem variant presented here within the framework of parameterized computational complexity. That is, we identify parametersp—properties of the input measured in integers—and aim to design fixed-parameter algorithms, which have running timef(p) ·nO(1)for a computable functionfindependent of the input sizen. In some cases, this allows us to give efficient algorithms for realistic problem instances despite the NP-hardness of the problems in general. We also show parameters that presumably cannot lead to fixed-parameter algorithms by showing some problem variants to be W[1]-hardfor these parameters.

5. We evaluate our theoretical findings by performing experiments with real-world data based on the publication profiles of AIresearchers. In particular, we use profiles of some young and up-and-coming researchers from the 2011 and 2013 editions of the IEEE “AI’s 10 to watch” list.