Authors
Bridges, MichaelHeron, Elizabeth A
O'Dushlaine, Colm
Segurado, Ricardo
Morris, Derek
Corvin, Aiden
Gill, Michael
Pinto, Carlos
Affiliation
Astrophysics Group, Cavendish Laboratory, Cambridge, United Kingdom.Issue Date
2011MeSH
BulgariaCase-Control Studies
Genetics, Population
Genome-Wide Association Study
Humans
Learning
Nerve Net
Polymorphism, Single Nucleotide
Principal Component Analysis
Scotland
Metadata
Show full item recordCitation
Genetic classification of populations using supervised learning. 2011, 6 (5):e14802 PLoS ONEJournal
PloS oneDOI
10.1371/journal.pone.0014802PubMed ID
21589856Abstract
There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case-control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available.In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.Item Type
ArticleLanguage
enISSN
1932-6203ae974a485f413a2113503eed53cd6c53
10.1371/journal.pone.0014802
Scopus Count
Collections
Related articles
- KLFDAPC: a supervised machine learning approach for spatial genetic structure analysis.
- Authors: Qin X, Chiang CWK, Gaggiotti OE
- Issue date: 2022 Jul 18
- Searching for disease susceptibility variants in structured populations.
- Authors: Roeder K, Luca D
- Issue date: 2009 Jan
- A Hybrid Supervised Approach to Human Population Identification Using Genomics Data.
- Authors: Araghi S, Nguyen T
- Issue date: 2021 Mar-Apr
- Choice of population structure informative principal components for adjustment in a case-control study.
- Authors: Peloso GM, Lunetta KL
- Issue date: 2011 Jul 19
- Assessing the power of principal components and wright's fixation index analyzes applied to reveal the genome-wide genetic differences between herds of Holstein cows.
- Authors: Smaragdov MG, Kudinov AA
- Issue date: 2020 Apr 28