Astrosta's'cs:	
  The	
  Role	
  of	
  
Sta's'cs	
  in	
  Astronomical	
  Research	
  
Eric	
  Feigelson	
  
Center	
  for	
  Astrosta2s2cs	
  
Penn	
  State	
  University	
  
BigSkyEarth	
  	
  	
  	
  DLR	
  Germany	
  	
  April	
  2016	
  
1	
  
The	
  underlying	
  situa0on	
  
Astronomers	
  are	
  well-­‐trained	
  in	
  the	
  mathema2cs	
  underlying	
  
physics,	
  but	
  not	
  in	
  applied	
  fields	
  associated	
  with	
  sta2s2cal	
  
methodology.	
  	
  	
  
	
  
Consequently,	
  many	
  astronomers	
  use	
  a	
  narrow	
  suite	
  of	
  familiar	
  
sta2s2cal	
  methods	
  that	
  are	
  oNen	
  non-­‐op2mal,	
  and	
  some2mes	
  
incorrectly	
  applied,	
  for	
  a	
  wide	
  range	
  of	
  data	
  and	
  science	
  analysis	
  
challenges.	
  	
  
	
  
This	
  talk	
  highlights	
  some	
  common	
  problems	
  in	
  recent	
  astronomical	
  
studies,	
  and	
  encourages	
  use	
  of	
  improved	
  methodology.	
  
2	
  
Outline	
  of	
  this	
  talk	
  
Ø  Astrosta2s2cs	
  =	
  Astronomy	
  +	
  Sta2s2cs:	
  not	
  so	
  simple	
  
Ø  History	
  of	
  astronomy	
  &	
  sta2s2cs:	
  	
  good	
  à	
  bad	
  	
  
Ø  Astrosta2s2cs	
  today:	
  improving	
  
Ø  R:	
  The	
  premier	
  sta2s2cal	
  compu2ng	
  environment	
  
Ø  Common	
  sta2s2cal	
  problems	
  in	
  astronomical	
  research	
  
	
  
3	
  
What is astronomy?
Astronomy is the observational study of matter beyond Earth:
planets in the Solar System, stars in the Milky Way Galaxy,
galaxies in the Universe, and diffuse matter between these
concentrations.
Astrophysics is the study of the intrinsic nature of astronomical
bodies and the processes by which they interact and evolve.
This is an indirect, inferential intellectual effort based on the
assumption that physics – gravity, electromagnetism, quantum
mechanics, etc – apply universally to distant cosmic
phenomena.
4	
  
What is statistics? (No consensus !!)
–  “… briefly, and in its most concrete form, the object of statistical
methods is the reduction of data”
(R. A. Fisher, 1922)
–  “Statistics is the mathematical body of science that pertains to the
collection, analysis, interpretation or explanation, and presentation
of data.”
(Wikipedia, 2014.0)
–  “Statistics is the study of the collection, analysis, interpretation,
presentation and organization of data.”
(Wikipedia, 2014.7)
–  “A statistical inference carries us from observations to conclusions
about the populations sampled”
(D. R. Cox, 1958)
5	
  
Does statistics relate to scientific models?
The pessimists …
“Essentially, all models are wrong, but some are useful.”
(Box & Draper 1987)
“There is no need for these hypotheses to be true, or even to be at
all like the truth; rather … they should yield calculations which
agree with observations” (Osiander’s Preface to Copernicus’
De Revolutionibus, quoted by C. R. Rao)
"The object [of statistical inference] is to provide ideas and
methods for the critical analysis and, as far as feasible, the
interpretation of empirical data ... The extremely challenging
issues of scientific inference may be regarded as those of
synthesising very different kinds of conclusions if possible into a
coherent whole or theory ... The use, if any, in the process of
simple quantitative notions of probability and their numerical
assessment is unclear."
(D. R. Cox, 2006)
6	
  
The positivists …
“The goal of science is to unlock nature’s secrets. … Our
understanding comes through the development of theoretical
models which are capable of explaining the existing
observations as well as making testable predictions. …
“Fortunately, a variety of sophisticated mathematical and
computational approaches have been developed to help us
through this interface, these go under the general heading of
statistical inference.”
(P. C. Gregory, Bayesian Logical Data Analysis for the
Physical Sciences, 2005)
7	
  
Recommended steps in the
statistical analysis of scientific data
The application of statistics can reliably quantify information
embedded in scientific data and help adjudicate the relevance
of theoretical models. But this is not a straightforward,
mechanical enterprise. It requires:
Ø exploration of the data
Ø careful statement of the scientific problem
Ø model formulation in mathematical form
Ø choice of statistical method(s)
Ø calculation of statistical quantities
Ø judicious scientific evaluation of the results
Astronomers often do not adequately pursue each step
8	
  
•  Modern statistics is vast in its scope and methodology. It is difficult
to find what may be useful (jargon problem!), and there are usually
several ways to proceed. Very confusing.
•  Some statistical procedures are based on mathematical proofs
which determine the applicability of established results. It is perilous
to violate mathematical truths! Some issues are debated among
statisticians, or have no known solution.
•  Scientific inferences should not depend on arbitrary choices in
methodology & variable scale. Prefer nonparametric & scale-
invariant methods. Try multiple methods.
•  It can be difficult to interpret the meaning of a statistical result with
respect to the scientific goal. Statistics is only a tool towards
understanding nature from incomplete information.
We should be knowledgeable in our use of statistics
and judicious in its interpretation	
   9	
  
Astronomy & Statistics: A glorious past
For most of western history,
the astronomers were the statisticians!
Ancient Greeks to 18th century
Best estimate of the length of a year from discrepant data?
•  Middle of range: Hipparcos (4th century B.C.)
•  Observe only once! (medieval)
•  Mean: Brahe (16th c), Galileo (17th c), Simpson (18th c)
•  Median (20th c)
19th century
Discrepant observations of planets/moons/comets used to estimate
orbital parameters using Newtonian celestial mechanics
•  Legendre, Laplace & Gauss develop least-squares regression
and normal error theory (c.1800-1820)
•  Prominent astronomers contribute to least-squares theory
(c.1850-1900)
10	
  
The lost century of astrostatistics….
In the late-19th and 20th centuries, statistics moved towards
human sciences (demography, economics, psychology,
medicine, politics) and industrial applications (agriculture,
mining, manufacturing).
During this time, astronomy recognized the power of
modern physics: electromagnetism, thermodynamics,
quantum mechanics, relativity. Astronomy & physics were
wedded into astrophysics.
Thus, astronomers and statisticians substantially broke contact;
e.g. the curriculum of astronomers heavily involved physics
but little statistics. Statisticians today know little modern
astronomy.
11	
  
The state of astrostatistics today
(not good!)
Many astronomical studies are confined to a narrow suite
of familiar statistical methods:
–  Fourier transform for temporal analysis (Fourier 1807)
–  Least squares regression for model fits
(Legendre 1805, Pearson 1901)
–  Kolmogorov-Smirnov goodness-of-fit test (Kolmogorov, 1933)
–  Principal components analysis for tables (Hotelling 1936)
Even traditional methods are often misused: final lecture on Friday
12	
  
Under-utilized methodology:
•  modeling (MLE, EM Algorithm, BIC, bootstrap)
•  multivariate classification (LDA, SVM, CART, RFs)
•  time series (autoregressive models, state space models)
•  spatial point processes (Ripley’s K, kriging)
•  nondetections (survival analysis)
•  image analysis (computer vision methods, False Detection Rate)
•  statistical computing (R)
Advertisement …
Modern Statistical Methods for Astronomy
with R Applications
E. D. Feigelson & G. J. Babu,
Cambridge Univ Press, 2012
!
!
"#$$%&!'()'!*+,-.!/01&2!34&!!
5%67!/67&4$489!:!;4684<4=9!544>!
!
!
13	
  
Cosmology Statistics
Galaxy clustering Spatial point processes, clustering
Galaxy morphology Regression, mixture models
Galaxy luminosity fn Gamma distribution
Power law relationships Pareto distribution
Weak lensing morphology Geostatistics, density estimation
Strong lensing morphology Shape statistics
Strong lensing timing Time series with lag
Faint source detection False Discovery Rate
Multiepoch survey lightcurves Multivariate classification
CMB spatial analysis Markov fields, ICA, etc
ΛCDM parameters Bayesian inference & model selection
Comparing data & simulation under development
An astrostatistics lexicon …
14	
  
Recent resurgence in astrostatistics
• Improved access to statistical software. R/CRAN public-domain statistical
software environment with thousands of functions. Increasing capability in Python.
• Papers in astronomical literature doubled to ~500/yr in past decade (“Methods:
statistical” papers in NASA-Smithsonian Astrophysics Data System)
• Short training courses (Penn State, India, Brazil, Spain, Greece, China, Italy, France,
ESO, ESA, conferences)
• Cross-disciplinary research collaborations (Harvard/ICHASC, Carnegie-Mellon, Penn
State, NASA-Ames/Stanford, CEA-Saclay/Stanford, Cornell, UC-Berkeley, Michigan, Imperial
College London, LSST Statistics & Informatics Science Collaboration, …)
• Cross-disciplinary conferences (Statistical Challenges in Modern Astronomy 1991--,
Astronomical Data Analysis, PhysStat, SAMSI programs 2012/16, Astroinformatics
2012--, CosmoStat 2014/16, IAU/WSC/JSM, …)
• Scholarly society working groups and a new integrated Web portal
asaip.psu.edu serving: Int’l Astrostatistical Assn (~ Int’l Statistical Institute), Int’l Astro
Union Working Group, Amer Astro Soc Working Group, Amer Stat Assn Interest Group,
IEEE Task Force, LSST Science Collaboration)
•  Increased review of statistical methodology by journals (Nature, Science, ApJ) 15	
  
Textbooks
Bayesian Logical Data Analysis for the Physical Sciences: A
Comparative Approach with Mathematica Support, Gregory, 2005
Practical Statistics for Astronomers, Wall & Jenkins, 2nd ed 2012
Modern Statistical Methods for Astronomy with R Applications,
Feigelson & Babu, 2012
Statistics, Data Mining, and Machine Learning in Astronomy: A
Practical Python Guide for the Analysis of Survey Data,
Ivecic, Connolly, VanderPlas & Gray, 2014
	
  
16	
  
A new imperative: Large-scale surveys & megadatasets
Huge imaging, spectroscopic & multivariate datasets are emerging
from specialized survey projects & telescopes:
–  109-object photometric catalogs from USNO, 2MASS, SDSS, …
–  106-8- spectroscopic catalogs from SDSS, LAMOST, …
–  106-7-source radio/infrared/X-ray catalogs from WISE, eROSITA, …
–  Spectral-image datacubes from VLA, ALMA, IFUs, …
–  109-object x 102 epochs (3D) surveys (PTF, CRTS, SNF, VVV, Pan-
STARRS, Stripe 82, DES, …, LSST)
The Virtual Observatory is an international effort to federate
many distributed on-line astronomical databases.
Powerful statistical tools are needed to derive
scientific insights from TBy-PBy-EBy databases
17	
  
To treat massive data streams and databases …
Rapid rise of astroinformatics
Statistics guides the scientist on what to compute
Informatics helps the scientist perform the computation
Methodology: Computationally intensive astronomy, data mining,
multivariate regression & classification, machine learning, Monte Carlo
methods, NlogN algorithms, etc.
Software & hardware: Parallel processing on multi-processors
machines, cloud computing, CUDA & GPU computing, database
management & promulgation, etc.
Workshops & training schools emerging. IAU Symposium #325
Astroinformatics in Sorrento IT, October 2016. Growing perception that
more community training is needed.
18	
  
Join a Working Group and the
Astrostatistics and Astroinformatics Portal
http://asaip.psu.edu
Recent papers, meetings, jobs, blogs, courses, forums, …
19	
  
A vision of astrostatistics in 2025 …
•  Astronomy graduate curriculum has 1 year of statistical and
computational methodology
•  Some astronomers have M.S. in statistics and computer science
•  Astrostatistics and astroinformatics is a well-funded, cross-
disciplinary research field involving a few percent of astronomers
(cf. astrochemists) pushing the frontiers of methodology.
•  Astronomers regularly use many methods coded in R.
•  Statistical Challenges in Modern Astronomy meetings are held
annually with ~400 participants
20	
  
Prelude to R ….
A brief history of statistical computing
1960s – c2000: Statistical analysis developed by academic
statisticians, but implementation relegated to commercial
companies (SAS, BMDP, Statistica, Stata, Minitab, etc).
1980s: John Chambers (ATT, USA)) develops S system, C-like
command line interface.
1990s: Ross Ihaka & Robert Gentleman (Univ Auckland NZ) mimic S
in an open source system, R. R Core Development Team expands,
GNU GPL release.
Early-2000s: Comprehensive R Analysis Network (CRAN) for
user-provided specialized packages grows exponentially. Important
packages incorporated into base-R.
21	
  
Growth of CRAN contributed packages
4 April 2016:
8206 packages
(~6/day)
~150,000
functions
See The Popularity of Data Analysis Software, R. A. Muenchen, http://r4stats.com
2	
  year	
  doubling	
  'me	
  
22	
  
Rexer Analytics Data Miner Survey 2013
Posts on software forums 2013
Job trends from Indeed.com
R
SPSS
See R vs. Python debates on
ASAIP Software Forum
R’s growing importance in data science
23	
  
The R statistical computing environment
•  R	
  integrates	
  data	
  manipula2on,	
  graphics	
  and	
  extensive	
  sta2s2cal	
  analysis.	
  
Uniform	
  documenta2on	
  and	
  coding	
  standards.	
  	
  But	
  quality	
  control	
  is	
  
limited	
  for	
  community-­‐provided	
  CRAN	
  packages.	
  	
  
	
  
•  Fully	
  programmable	
  C-­‐like	
  language,	
  similar	
  to	
  IDL	
  &	
  Matlab.	
  Specializes	
  in	
  
vector/matrix	
  inputs.	
  	
  	
  
	
  
•  Easy	
  download	
  from	
  hbp://www.r-­‐project.org	
  for	
  Windows,	
  Mac	
  or	
  linux.	
  
On-­‐the-­‐fly	
  installa2on	
  of	
  CRAN	
  packages.	
  	
  	
  Quick	
  communica2on	
  with	
  C,	
  
Fortran,	
  Python.	
  	
  Emulator	
  of	
  Matlab.	
  	
  
•  >8000	
  user-­‐provided	
  add-­‐on	
  CRAN	
  packages,	
  ~150,000	
  sta2s2cal	
  
func2ons	
  
	
  
24	
  
•  Many	
  resources:	
  	
  R	
  help	
  files	
  (3500p	
  for	
  base	
  R),	
  CRAN	
  Task	
  Views	
  	
  and	
  vignebe	
  
files,	
  on-­‐line	
  tutorials,	
  >150	
  books,	
  >400	
  blogs,	
  Use	
  R!	
  conferences,	
  galleries,	
  
companies,	
  The	
  R	
  Journal	
  &	
  J.	
  Stat.	
  So3ware,	
  etc.	
  	
  
	
  
Principal	
  steps	
  for	
  using	
  R	
  in	
  astronomical	
  research:	
  
–  Knowing	
  what	
  you	
  want 	
   	
  [educa0on,	
  consul0ng,	
  thought]	
  
–  Finding	
  what	
  you	
  want	
   	
   	
  [Google,	
  Rseek,	
  Rdocumenta0on]	
  
–  Wri'ng	
  R	
  scripts 	
   	
   	
   	
  [R	
  Help	
  files,	
  StackOverflow,	
  books]	
  
–  Understanding	
  what	
  you	
  find	
   	
  [educa0on,	
  consul0ng,	
  thought]	
  
	
  
25	
  
Some functionalities of base R
arithme2c	
  &	
  linear	
  algebra	
  
bootstrap	
  resampling	
  
empirical	
  distribu2on	
  tests	
  
exploratory	
  data	
  analysis	
  	
  
generalized	
  linear	
  modeling	
  
graphics	
  
robust	
  sta2s2cs	
  
linear	
  programming	
  
local	
  and	
  ridge	
  regression	
  
max	
  likelihood	
  es2ma2on	
  
	
  
mul2variate	
  analysis	
  
mul2variate	
  clustering	
  
neural	
  networks	
  
smoothing	
  
spa2al	
  point	
  processes	
  
sta2s2cal	
  distribu2ons	
  	
  
sta2s2cal	
  tests	
  
survival	
  analysis	
  
2me	
  series	
  analysis	
  
26	
  
Selected methods in Comprehensive R Archive Network (CRAN)
Bayesian computation & MCMC, classification & regression trees, genetic
algorithms, geostatistical modeling, hidden Markov models, irregular
time series, kernel-based machine learning, least-angle & lasso
regression, likelihood ratios, map projections, mixture models & model-
based clustering, nonlinear least squares, multidimensional analysis,
multimodality test, multivariate time series, multivariate outlier
detection, neural networks, non-linear time series analysis,
nonparametric multiple comparisons, omnibus tests for normality,
orientation data, parallel coordinates plots, partial least squares,
periodic autoregression analysis, principal curve fits, projection pursuit,
quantile regression, random fields, Random Forest classification, ridge
regression, robust regression, Self-Organizing Maps, shape analysis,
space-time ecological analysis, spatial analyisis & kriging, spline
regressions, tessellations, three-dimensional visualization, wavelet
toolbox
27	
  
CRAN Task Views
(http://cran.r-project.org/web/views)
CRAN	
  Task	
  Views	
  provide	
  brief	
  overviews	
  of	
  CRAN	
  packages	
  by	
  topic	
  &	
  
func2onality.	
  	
  Maintained	
  be	
  expert	
  volunteers.	
  	
  	
  
Par2al	
  list:	
  
	
  
•  Bayesian	
  	
  	
   	
  ~110	
  packages	
  
•  Chem/Phys	
  	
   	
   	
  ~75	
  packages	
  (incl.	
  20	
  for	
  astronomy)	
  
•  Cluster/Mixture 	
  ~100	
  packages	
  
•  Graphics	
   	
   	
   	
  ~40	
  packages	
  
•  HighPerfComp	
  ~75	
  packages	
  
•  Machine	
  Learning 	
  ~70	
  packages	
  
•  Medical	
  imaging 	
  ~20	
  packages	
  
•  Robust 	
   	
   	
  ~50	
  packages	
  
•  Spa2al 	
   	
   	
  ~135	
  packages	
  
•  Survival 	
   	
   	
  ~200	
  packages	
  
•  TimeSeries 	
   	
  ~170	
  packages	
  
	
  
	
   28	
  
Since c.2005, R has been the
world’s premier
public-domain
statistical computing package
Data scientists recommend both Python and R
(https://asaip.psu.edu/forums/software-forum/195790576)
29	
  
Some	
  common	
  sta0s0cal	
  problems	
  
in	
  astronomical	
  papers	
  
o  Overuse	
  of	
  Kolmogorov-­‐Smirnov	
  test:	
  	
  incorrect	
  significance	
  levels,	
  
less	
  sensi2ve	
  than	
  Anderson-­‐Darling	
  test	
  
o  Overuse	
  of	
  histograms	
  for	
  inference	
  
	
  
o  Overuse	
  of	
  heuris2c	
  parametric	
  regression	
  (e.g.	
  linear,	
  powerlaw).	
  	
  
Use	
  new	
  local	
  regression	
  methods	
  (splines,	
  LOESS,	
  Gaussian	
  
Processes	
  regression)	
  
	
  
o  Overuse	
  of	
  `minimum	
  chi-­‐squared’	
  regression,	
  assuming	
  scaber	
  is	
  
due	
  to	
  measurement	
  errors	
  
	
  
30	
  
o  Overuse	
  of	
  regression	
  when	
  response	
  variable	
  not	
  specified	
  by	
  science	
  
	
  
o  Underuse	
  of	
  Poisson	
  &	
  logis2c	
  regression	
  
	
  
o  Insufficient	
  examina2on	
  of	
  regression	
  results:	
  R2,	
  residual	
  analysis	
  (test	
  for	
  
normality,	
  autocorrela2on,	
  outliers	
  via	
  Cook’s	
  distance)	
  
	
  
o  	
  Overuse	
  of	
  Bayesian	
  inference	
  with	
  uninforma2ve	
  priors	
  
	
  
o  	
  Overuse	
  of	
  `friends-­‐of-­‐friends’	
  algorithm	
  or	
  subjec2ve	
  evalua2on	
  for	
  
unsupervised	
  clustering	
  
	
  
o  	
  Underuse	
  of	
  machine	
  learning	
  methods	
  for	
  supervised	
  classifica2on	
  
(CART/Random	
  Forests,	
  Support	
  Vector	
  Machines,	
  neural	
  networks,	
  …)	
  
	
  
31	
  
Conclusion	
  
While	
  a	
  vanguard	
  of	
  astronomers	
  use	
  and	
  develop	
  advanced	
  
methodologies	
  for	
  specific	
  applica2ons,	
  many	
  studies	
  u2lize	
  a	
  narrow	
  
suite	
  of	
  familiar	
  methods.	
  
	
  
Astronomers	
  need	
  to	
  become	
  more	
  informed	
  and	
  more	
  involved	
  in	
  
sta2s2cal	
  methodology,	
  for	
  both	
  data	
  analysis	
  and	
  for	
  science	
  
analysis.	
  
	
  
Areas	
  of	
  common	
  weakness	
  of	
  sta2s2cal	
  analyses	
  in	
  astronomical	
  
studies	
  can	
  be	
  iden2fied.	
  	
  Improvement	
  is	
  oNen	
  not	
  difficult.	
  	
  Highly	
  
capable	
  free	
  soNware,	
  such	
  as	
  R/CRAN,	
  can	
  be	
  effec2ve	
  in	
  bringing	
  
new	
  methodology	
  to	
  bear	
  on	
  astronomical	
  problems.	
  	
  
32	
  

More Related Content

PDF
Methods of Nonlinear and Multivalued Analysis for Multinature Controlled Proc...
PDF
Foundations of Statistics for Ecology and Evolution. 1 Introduction.
PPTX
Physics
PPT
The 'Perspective Shift' in bibliometrics and its consequences
PDF
Computing and Using Metrics in the ADS
PPTX
What is my favorite subject by abram villame
PPTX
Sample of slides for Statistics for Geography and Environmental Science
PDF
Michael Francis Atiyah-El futuro de la investigación matemática
Methods of Nonlinear and Multivalued Analysis for Multinature Controlled Proc...
Foundations of Statistics for Ecology and Evolution. 1 Introduction.
Physics
The 'Perspective Shift' in bibliometrics and its consequences
Computing and Using Metrics in the ADS
What is my favorite subject by abram villame
Sample of slides for Statistics for Geography and Environmental Science
Michael Francis Atiyah-El futuro de la investigación matemática

Viewers also liked (11)

PDF
07 dimensionality reduction
PDF
05 sensor signal_models_feature_extraction
PDF
08 visualisation seminar ver0.2
PDF
08 distributed optimization
PDF
07 data structures_and_representations
PDF
06 ashish mahabal bse2
PDF
07 big skyearth_dlr_7_april_2016
PDF
06 ashish mahabal bse1
PDF
04 open source_tools
PDF
06 ashish mahabal bse3
PDF
04 bigdata and_cloud_computing
07 dimensionality reduction
05 sensor signal_models_feature_extraction
08 visualisation seminar ver0.2
08 distributed optimization
07 data structures_and_representations
06 ashish mahabal bse2
07 big skyearth_dlr_7_april_2016
06 ashish mahabal bse1
04 open source_tools
06 ashish mahabal bse3
04 bigdata and_cloud_computing
Ad

Similar to 05 astrostat feigelson (20)

PPTX
Role of Statistics in Scientific Research
PDF
Bayesian Models For Astrophysical Data Using R Jags Python And Stan Joseph M ...
PDF
The-Fascinating-History-of-Statistics.pdf
PDF
Statistics and Bioscience: Association in Research_Crimson Publishers
DOCX
Discussions on the growth and future of biostatistics
PDF
Introduction to basic statistics 1 Don Ozisco
PPTX
Introduction to statistics
PDF
Biostatistics
PPTX
Biostatistics Master’s Degree by Slidesgo.pptx
PDF
Introduction to basic statistics 1 Don Ozisco
PDF
The basic practice of statistics 3rd Edition David S. Moore
PDF
PDF
Statistic note
PPTX
CONCEPT OF STATISTICS for BSc Nursing 4th year.
PPTX
CONCEPT OF STATISTICS.pptx Bsc nursing 4th year
PDF
Ch0_Introduction_What sis Statistics.pdf
PPTX
Statistical theory.3.18.15
PPTX
Statistics for IB Biology
PPT
Basic statistics by Neeraj Bhandari ( Surkhet.Nepal )
PDF
“The importance of philosophy of science for statistical science and vice versa”
Role of Statistics in Scientific Research
Bayesian Models For Astrophysical Data Using R Jags Python And Stan Joseph M ...
The-Fascinating-History-of-Statistics.pdf
Statistics and Bioscience: Association in Research_Crimson Publishers
Discussions on the growth and future of biostatistics
Introduction to basic statistics 1 Don Ozisco
Introduction to statistics
Biostatistics
Biostatistics Master’s Degree by Slidesgo.pptx
Introduction to basic statistics 1 Don Ozisco
The basic practice of statistics 3rd Edition David S. Moore
Statistic note
CONCEPT OF STATISTICS for BSc Nursing 4th year.
CONCEPT OF STATISTICS.pptx Bsc nursing 4th year
Ch0_Introduction_What sis Statistics.pdf
Statistical theory.3.18.15
Statistics for IB Biology
Basic statistics by Neeraj Bhandari ( Surkhet.Nepal )
“The importance of philosophy of science for statistical science and vice versa”
Ad

Recently uploaded (20)

PPTX
The Female Reproductive System - Grade 10 ppt
PPTX
Basic principles of chromatography techniques
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
PDF
final prehhhejjehehhehehehebesentation.pdf
PPTX
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
PPTX
EPILEPSY UPDATE in kkm malaysia today new
PPTX
ELISA(Enzyme linked immunosorbent assay)
PPTX
Preformulation.pptx Preformulation studies-Including all parameter
PDF
CHEM - GOC general organic chemistry.ppt
PPTX
02_OpenStax_Chemistry_Slides_20180406 copy.pptx
PDF
Chemistry and Changes 8th Grade Science .pdf
PPTX
Spectroscopy techniques in forensic science _ppt.pptx
PDF
The Physiology Of The Red Blood Cells pdf
PDF
Social preventive and pharmacy. Pdf
PPT
ecg for noob ecg interpretation ecg recall
PPTX
diabetes and its complications nephropathy neuropathy
PPTX
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
PDF
Micro 4 New.ppt.pdf thesis main microbio
PDF
2019UpdateAHAASAAISGuidelineSlideDeckrevisedADL12919.pdf
PDF
No dilute core produced in simulations of giant impacts on to Jupiter
The Female Reproductive System - Grade 10 ppt
Basic principles of chromatography techniques
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
final prehhhejjehehhehehehebesentation.pdf
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
EPILEPSY UPDATE in kkm malaysia today new
ELISA(Enzyme linked immunosorbent assay)
Preformulation.pptx Preformulation studies-Including all parameter
CHEM - GOC general organic chemistry.ppt
02_OpenStax_Chemistry_Slides_20180406 copy.pptx
Chemistry and Changes 8th Grade Science .pdf
Spectroscopy techniques in forensic science _ppt.pptx
The Physiology Of The Red Blood Cells pdf
Social preventive and pharmacy. Pdf
ecg for noob ecg interpretation ecg recall
diabetes and its complications nephropathy neuropathy
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
Micro 4 New.ppt.pdf thesis main microbio
2019UpdateAHAASAAISGuidelineSlideDeckrevisedADL12919.pdf
No dilute core produced in simulations of giant impacts on to Jupiter

05 astrostat feigelson

  • 1. Astrosta's'cs:  The  Role  of   Sta's'cs  in  Astronomical  Research   Eric  Feigelson   Center  for  Astrosta2s2cs   Penn  State  University   BigSkyEarth        DLR  Germany    April  2016   1  
  • 2. The  underlying  situa0on   Astronomers  are  well-­‐trained  in  the  mathema2cs  underlying   physics,  but  not  in  applied  fields  associated  with  sta2s2cal   methodology.         Consequently,  many  astronomers  use  a  narrow  suite  of  familiar   sta2s2cal  methods  that  are  oNen  non-­‐op2mal,  and  some2mes   incorrectly  applied,  for  a  wide  range  of  data  and  science  analysis   challenges.       This  talk  highlights  some  common  problems  in  recent  astronomical   studies,  and  encourages  use  of  improved  methodology.   2  
  • 3. Outline  of  this  talk   Ø  Astrosta2s2cs  =  Astronomy  +  Sta2s2cs:  not  so  simple   Ø  History  of  astronomy  &  sta2s2cs:    good  à  bad     Ø  Astrosta2s2cs  today:  improving   Ø  R:  The  premier  sta2s2cal  compu2ng  environment   Ø  Common  sta2s2cal  problems  in  astronomical  research     3  
  • 4. What is astronomy? Astronomy is the observational study of matter beyond Earth: planets in the Solar System, stars in the Milky Way Galaxy, galaxies in the Universe, and diffuse matter between these concentrations. Astrophysics is the study of the intrinsic nature of astronomical bodies and the processes by which they interact and evolve. This is an indirect, inferential intellectual effort based on the assumption that physics – gravity, electromagnetism, quantum mechanics, etc – apply universally to distant cosmic phenomena. 4  
  • 5. What is statistics? (No consensus !!) –  “… briefly, and in its most concrete form, the object of statistical methods is the reduction of data” (R. A. Fisher, 1922) –  “Statistics is the mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data.” (Wikipedia, 2014.0) –  “Statistics is the study of the collection, analysis, interpretation, presentation and organization of data.” (Wikipedia, 2014.7) –  “A statistical inference carries us from observations to conclusions about the populations sampled” (D. R. Cox, 1958) 5  
  • 6. Does statistics relate to scientific models? The pessimists … “Essentially, all models are wrong, but some are useful.” (Box & Draper 1987) “There is no need for these hypotheses to be true, or even to be at all like the truth; rather … they should yield calculations which agree with observations” (Osiander’s Preface to Copernicus’ De Revolutionibus, quoted by C. R. Rao) "The object [of statistical inference] is to provide ideas and methods for the critical analysis and, as far as feasible, the interpretation of empirical data ... The extremely challenging issues of scientific inference may be regarded as those of synthesising very different kinds of conclusions if possible into a coherent whole or theory ... The use, if any, in the process of simple quantitative notions of probability and their numerical assessment is unclear." (D. R. Cox, 2006) 6  
  • 7. The positivists … “The goal of science is to unlock nature’s secrets. … Our understanding comes through the development of theoretical models which are capable of explaining the existing observations as well as making testable predictions. … “Fortunately, a variety of sophisticated mathematical and computational approaches have been developed to help us through this interface, these go under the general heading of statistical inference.” (P. C. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, 2005) 7  
  • 8. Recommended steps in the statistical analysis of scientific data The application of statistics can reliably quantify information embedded in scientific data and help adjudicate the relevance of theoretical models. But this is not a straightforward, mechanical enterprise. It requires: Ø exploration of the data Ø careful statement of the scientific problem Ø model formulation in mathematical form Ø choice of statistical method(s) Ø calculation of statistical quantities Ø judicious scientific evaluation of the results Astronomers often do not adequately pursue each step 8  
  • 9. •  Modern statistics is vast in its scope and methodology. It is difficult to find what may be useful (jargon problem!), and there are usually several ways to proceed. Very confusing. •  Some statistical procedures are based on mathematical proofs which determine the applicability of established results. It is perilous to violate mathematical truths! Some issues are debated among statisticians, or have no known solution. •  Scientific inferences should not depend on arbitrary choices in methodology & variable scale. Prefer nonparametric & scale- invariant methods. Try multiple methods. •  It can be difficult to interpret the meaning of a statistical result with respect to the scientific goal. Statistics is only a tool towards understanding nature from incomplete information. We should be knowledgeable in our use of statistics and judicious in its interpretation   9  
  • 10. Astronomy & Statistics: A glorious past For most of western history, the astronomers were the statisticians! Ancient Greeks to 18th century Best estimate of the length of a year from discrepant data? •  Middle of range: Hipparcos (4th century B.C.) •  Observe only once! (medieval) •  Mean: Brahe (16th c), Galileo (17th c), Simpson (18th c) •  Median (20th c) 19th century Discrepant observations of planets/moons/comets used to estimate orbital parameters using Newtonian celestial mechanics •  Legendre, Laplace & Gauss develop least-squares regression and normal error theory (c.1800-1820) •  Prominent astronomers contribute to least-squares theory (c.1850-1900) 10  
  • 11. The lost century of astrostatistics…. In the late-19th and 20th centuries, statistics moved towards human sciences (demography, economics, psychology, medicine, politics) and industrial applications (agriculture, mining, manufacturing). During this time, astronomy recognized the power of modern physics: electromagnetism, thermodynamics, quantum mechanics, relativity. Astronomy & physics were wedded into astrophysics. Thus, astronomers and statisticians substantially broke contact; e.g. the curriculum of astronomers heavily involved physics but little statistics. Statisticians today know little modern astronomy. 11  
  • 12. The state of astrostatistics today (not good!) Many astronomical studies are confined to a narrow suite of familiar statistical methods: –  Fourier transform for temporal analysis (Fourier 1807) –  Least squares regression for model fits (Legendre 1805, Pearson 1901) –  Kolmogorov-Smirnov goodness-of-fit test (Kolmogorov, 1933) –  Principal components analysis for tables (Hotelling 1936) Even traditional methods are often misused: final lecture on Friday 12  
  • 13. Under-utilized methodology: •  modeling (MLE, EM Algorithm, BIC, bootstrap) •  multivariate classification (LDA, SVM, CART, RFs) •  time series (autoregressive models, state space models) •  spatial point processes (Ripley’s K, kriging) •  nondetections (survival analysis) •  image analysis (computer vision methods, False Detection Rate) •  statistical computing (R) Advertisement … Modern Statistical Methods for Astronomy with R Applications E. D. Feigelson & G. J. Babu, Cambridge Univ Press, 2012 ! ! "#$$%&!'()'!*+,-.!/01&2!34&!! 5%67!/67&4$489!:!;4684<4=9!544>! ! ! 13  
  • 14. Cosmology Statistics Galaxy clustering Spatial point processes, clustering Galaxy morphology Regression, mixture models Galaxy luminosity fn Gamma distribution Power law relationships Pareto distribution Weak lensing morphology Geostatistics, density estimation Strong lensing morphology Shape statistics Strong lensing timing Time series with lag Faint source detection False Discovery Rate Multiepoch survey lightcurves Multivariate classification CMB spatial analysis Markov fields, ICA, etc ΛCDM parameters Bayesian inference & model selection Comparing data & simulation under development An astrostatistics lexicon … 14  
  • 15. Recent resurgence in astrostatistics • Improved access to statistical software. R/CRAN public-domain statistical software environment with thousands of functions. Increasing capability in Python. • Papers in astronomical literature doubled to ~500/yr in past decade (“Methods: statistical” papers in NASA-Smithsonian Astrophysics Data System) • Short training courses (Penn State, India, Brazil, Spain, Greece, China, Italy, France, ESO, ESA, conferences) • Cross-disciplinary research collaborations (Harvard/ICHASC, Carnegie-Mellon, Penn State, NASA-Ames/Stanford, CEA-Saclay/Stanford, Cornell, UC-Berkeley, Michigan, Imperial College London, LSST Statistics & Informatics Science Collaboration, …) • Cross-disciplinary conferences (Statistical Challenges in Modern Astronomy 1991--, Astronomical Data Analysis, PhysStat, SAMSI programs 2012/16, Astroinformatics 2012--, CosmoStat 2014/16, IAU/WSC/JSM, …) • Scholarly society working groups and a new integrated Web portal asaip.psu.edu serving: Int’l Astrostatistical Assn (~ Int’l Statistical Institute), Int’l Astro Union Working Group, Amer Astro Soc Working Group, Amer Stat Assn Interest Group, IEEE Task Force, LSST Science Collaboration) •  Increased review of statistical methodology by journals (Nature, Science, ApJ) 15  
  • 16. Textbooks Bayesian Logical Data Analysis for the Physical Sciences: A Comparative Approach with Mathematica Support, Gregory, 2005 Practical Statistics for Astronomers, Wall & Jenkins, 2nd ed 2012 Modern Statistical Methods for Astronomy with R Applications, Feigelson & Babu, 2012 Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data, Ivecic, Connolly, VanderPlas & Gray, 2014   16  
  • 17. A new imperative: Large-scale surveys & megadatasets Huge imaging, spectroscopic & multivariate datasets are emerging from specialized survey projects & telescopes: –  109-object photometric catalogs from USNO, 2MASS, SDSS, … –  106-8- spectroscopic catalogs from SDSS, LAMOST, … –  106-7-source radio/infrared/X-ray catalogs from WISE, eROSITA, … –  Spectral-image datacubes from VLA, ALMA, IFUs, … –  109-object x 102 epochs (3D) surveys (PTF, CRTS, SNF, VVV, Pan- STARRS, Stripe 82, DES, …, LSST) The Virtual Observatory is an international effort to federate many distributed on-line astronomical databases. Powerful statistical tools are needed to derive scientific insights from TBy-PBy-EBy databases 17  
  • 18. To treat massive data streams and databases … Rapid rise of astroinformatics Statistics guides the scientist on what to compute Informatics helps the scientist perform the computation Methodology: Computationally intensive astronomy, data mining, multivariate regression & classification, machine learning, Monte Carlo methods, NlogN algorithms, etc. Software & hardware: Parallel processing on multi-processors machines, cloud computing, CUDA & GPU computing, database management & promulgation, etc. Workshops & training schools emerging. IAU Symposium #325 Astroinformatics in Sorrento IT, October 2016. Growing perception that more community training is needed. 18  
  • 19. Join a Working Group and the Astrostatistics and Astroinformatics Portal http://asaip.psu.edu Recent papers, meetings, jobs, blogs, courses, forums, … 19  
  • 20. A vision of astrostatistics in 2025 … •  Astronomy graduate curriculum has 1 year of statistical and computational methodology •  Some astronomers have M.S. in statistics and computer science •  Astrostatistics and astroinformatics is a well-funded, cross- disciplinary research field involving a few percent of astronomers (cf. astrochemists) pushing the frontiers of methodology. •  Astronomers regularly use many methods coded in R. •  Statistical Challenges in Modern Astronomy meetings are held annually with ~400 participants 20  
  • 21. Prelude to R …. A brief history of statistical computing 1960s – c2000: Statistical analysis developed by academic statisticians, but implementation relegated to commercial companies (SAS, BMDP, Statistica, Stata, Minitab, etc). 1980s: John Chambers (ATT, USA)) develops S system, C-like command line interface. 1990s: Ross Ihaka & Robert Gentleman (Univ Auckland NZ) mimic S in an open source system, R. R Core Development Team expands, GNU GPL release. Early-2000s: Comprehensive R Analysis Network (CRAN) for user-provided specialized packages grows exponentially. Important packages incorporated into base-R. 21  
  • 22. Growth of CRAN contributed packages 4 April 2016: 8206 packages (~6/day) ~150,000 functions See The Popularity of Data Analysis Software, R. A. Muenchen, http://r4stats.com 2  year  doubling  'me   22  
  • 23. Rexer Analytics Data Miner Survey 2013 Posts on software forums 2013 Job trends from Indeed.com R SPSS See R vs. Python debates on ASAIP Software Forum R’s growing importance in data science 23  
  • 24. The R statistical computing environment •  R  integrates  data  manipula2on,  graphics  and  extensive  sta2s2cal  analysis.   Uniform  documenta2on  and  coding  standards.    But  quality  control  is   limited  for  community-­‐provided  CRAN  packages.       •  Fully  programmable  C-­‐like  language,  similar  to  IDL  &  Matlab.  Specializes  in   vector/matrix  inputs.         •  Easy  download  from  hbp://www.r-­‐project.org  for  Windows,  Mac  or  linux.   On-­‐the-­‐fly  installa2on  of  CRAN  packages.      Quick  communica2on  with  C,   Fortran,  Python.    Emulator  of  Matlab.     •  >8000  user-­‐provided  add-­‐on  CRAN  packages,  ~150,000  sta2s2cal   func2ons     24  
  • 25. •  Many  resources:    R  help  files  (3500p  for  base  R),  CRAN  Task  Views    and  vignebe   files,  on-­‐line  tutorials,  >150  books,  >400  blogs,  Use  R!  conferences,  galleries,   companies,  The  R  Journal  &  J.  Stat.  So3ware,  etc.       Principal  steps  for  using  R  in  astronomical  research:   –  Knowing  what  you  want    [educa0on,  consul0ng,  thought]   –  Finding  what  you  want      [Google,  Rseek,  Rdocumenta0on]   –  Wri'ng  R  scripts        [R  Help  files,  StackOverflow,  books]   –  Understanding  what  you  find    [educa0on,  consul0ng,  thought]     25  
  • 26. Some functionalities of base R arithme2c  &  linear  algebra   bootstrap  resampling   empirical  distribu2on  tests   exploratory  data  analysis     generalized  linear  modeling   graphics   robust  sta2s2cs   linear  programming   local  and  ridge  regression   max  likelihood  es2ma2on     mul2variate  analysis   mul2variate  clustering   neural  networks   smoothing   spa2al  point  processes   sta2s2cal  distribu2ons     sta2s2cal  tests   survival  analysis   2me  series  analysis   26  
  • 27. Selected methods in Comprehensive R Archive Network (CRAN) Bayesian computation & MCMC, classification & regression trees, genetic algorithms, geostatistical modeling, hidden Markov models, irregular time series, kernel-based machine learning, least-angle & lasso regression, likelihood ratios, map projections, mixture models & model- based clustering, nonlinear least squares, multidimensional analysis, multimodality test, multivariate time series, multivariate outlier detection, neural networks, non-linear time series analysis, nonparametric multiple comparisons, omnibus tests for normality, orientation data, parallel coordinates plots, partial least squares, periodic autoregression analysis, principal curve fits, projection pursuit, quantile regression, random fields, Random Forest classification, ridge regression, robust regression, Self-Organizing Maps, shape analysis, space-time ecological analysis, spatial analyisis & kriging, spline regressions, tessellations, three-dimensional visualization, wavelet toolbox 27  
  • 28. CRAN Task Views (http://cran.r-project.org/web/views) CRAN  Task  Views  provide  brief  overviews  of  CRAN  packages  by  topic  &   func2onality.    Maintained  be  expert  volunteers.       Par2al  list:     •  Bayesian        ~110  packages   •  Chem/Phys        ~75  packages  (incl.  20  for  astronomy)   •  Cluster/Mixture  ~100  packages   •  Graphics        ~40  packages   •  HighPerfComp  ~75  packages   •  Machine  Learning  ~70  packages   •  Medical  imaging  ~20  packages   •  Robust      ~50  packages   •  Spa2al      ~135  packages   •  Survival      ~200  packages   •  TimeSeries    ~170  packages       28  
  • 29. Since c.2005, R has been the world’s premier public-domain statistical computing package Data scientists recommend both Python and R (https://asaip.psu.edu/forums/software-forum/195790576) 29  
  • 30. Some  common  sta0s0cal  problems   in  astronomical  papers   o  Overuse  of  Kolmogorov-­‐Smirnov  test:    incorrect  significance  levels,   less  sensi2ve  than  Anderson-­‐Darling  test   o  Overuse  of  histograms  for  inference     o  Overuse  of  heuris2c  parametric  regression  (e.g.  linear,  powerlaw).     Use  new  local  regression  methods  (splines,  LOESS,  Gaussian   Processes  regression)     o  Overuse  of  `minimum  chi-­‐squared’  regression,  assuming  scaber  is   due  to  measurement  errors     30  
  • 31. o  Overuse  of  regression  when  response  variable  not  specified  by  science     o  Underuse  of  Poisson  &  logis2c  regression     o  Insufficient  examina2on  of  regression  results:  R2,  residual  analysis  (test  for   normality,  autocorrela2on,  outliers  via  Cook’s  distance)     o   Overuse  of  Bayesian  inference  with  uninforma2ve  priors     o   Overuse  of  `friends-­‐of-­‐friends’  algorithm  or  subjec2ve  evalua2on  for   unsupervised  clustering     o   Underuse  of  machine  learning  methods  for  supervised  classifica2on   (CART/Random  Forests,  Support  Vector  Machines,  neural  networks,  …)     31  
  • 32. Conclusion   While  a  vanguard  of  astronomers  use  and  develop  advanced   methodologies  for  specific  applica2ons,  many  studies  u2lize  a  narrow   suite  of  familiar  methods.     Astronomers  need  to  become  more  informed  and  more  involved  in   sta2s2cal  methodology,  for  both  data  analysis  and  for  science   analysis.     Areas  of  common  weakness  of  sta2s2cal  analyses  in  astronomical   studies  can  be  iden2fied.    Improvement  is  oNen  not  difficult.    Highly   capable  free  soNware,  such  as  R/CRAN,  can  be  effec2ve  in  bringing   new  methodology  to  bear  on  astronomical  problems.     32