5
Most read
7
Most read
9
Most read
Visualizing High
Dimensional Data with
Manifold Learning in R
BY COLLEEN M. FARRELLY, DATA SCIENTIST AT GRAHAM HOLDINGS
(KAPLAN HIGHER AND PROFESSIONAL EDUCATION)
My Path to Data Science
Former MD/PhD student who started doing research/attending workshops in geometry,
topology, and machine learning
Switched degree programs into biostatistics with a topology-based slant
Have worked in biotechnology, military, education, and the social sciences
Currently on the business side of running a university, with a lot of financial modeling and risk
modeling
Mining for Data Relationships
Exploratory analysis
 Important step in data science
projects
 Trend/covariance visualization
 Clustering
 Powerful combination for
understanding many types of
problems
Types of data problems
 Time series analyses
 Predictive analyses
 Network analyses
9
3
13
5
1
7
8
14
10
11
12
6
16
15
17
2
4
0204060
Intelligence and Achievement Dendrogram
hclust (*, "complete")
dist(mydata[, 2:4])
Height
Unique subgroup identified
Time Series and Financial Data
Key tasks in time
series/financial data
analyses:
 Forecasting future time
points
 Identifying drivers of the
dynamic process (ex. why
are sales rising?)
 Identifying tipping points
(crashes, spikes…)
 Identifying covarying
behavior (sectors that
behave similarly, stocks that
influence each other, daily
rising/falling patterns…)
Dow Jones Industrial Average
Morse-Smale Clustering
Multivariate technique from topology
similar to mode clustering
 Find peaks and valleys in data by filtering
on a defined function:
 A watershed on mountains
 Dribbling a soccer ball across a field of hills
 Separate data based on shared peaks
and valleys
 Many nice developments on
convergence and theoretical properties
R package has nice dimensionality
reduction plots to highlight cluster
differences with respect to the filter
function and predictor sets
5
Dimensionality Reduction and Visualization
Helpful in visualizing multivariate trends and group
differences, particularly for multivariate time series
data
Assume data lies in a lower-dimensional subspace and
map full dataset to that subspace (right)
Types of methods:
 Linear (principal component analysis, or PCA)
 Nonlinear (manifold learning)
 Local (preserving neighborhood metrics like distance
between points)
 Global (preserving global characteristics like
connectedness and limits)
Manifold learning methods related to a branch of
mathematics called differential geometry
Manifold Learning Methods
Three main methods considered in this analysis:
 Multidimensional scaling (MDS)
 Global method based on distance preservation and matrix
decomposition
 Distances can be Euclidean, geodesic, Manhattan...
 Nice theoretical result relating it to PCA when best subspace is
linear
 Locally linear embedding (LLE)
 Local method based on nearest neighbor graph, weighting, and
matrix decomposition
 Related to ISOMAP and other methods
 t-distributed stochastic neighbor embedding (t-SNE)
 Local and global method based on mapping of probability
distributions and random walks
 Preserves both local and global characteristics of the original data
space
 Very strong performance on a variety of problems lately
Breast Cancer Dataset Comparison
Example Stock Market Dataset
Emerging markets
 Important for investors
 Future drivers of global trade
 Global trends
 Daily fluctuations
 Tipping points (crashes and opportunities)
This example:
 Recent Kaggle dataset of daily National Stock
Exchange of India prices from July 2003-
February 2018:
 https://www.kaggle.com/abhishekyana/nse-listed-
1384-companies-data/data
 Cleaned (nulls removed, <1%) and daily fluctuation
ranges added (7 total time series columns)
 3616 days included
Clustering Results
R package (msr)
 10 nearest
neighbors
 Persistence
level=1
 5 level splits
 Plot of group
trajectories (far
left)
4 distinct groups
 2 represent stable
trends (red, blue)
 2 represent
transition points in
market behavior
(green, aqua)
PCA Plot
R function
princomp()
with 2
components
Fits quite well
and shows
spread within
each cluster
MDS Plot
R function
cmdscale() with
2 components
and a Euclidean
distance metric
Relationships
very linear and
well-separated
globally
 Matches PCA
well
 Separates into:
1. Daily price
2. Daily
fluctuation
0 5000 10000 15000
-600-400-2000
MDS Results
Dimension 1
Dimension2
LLE Plot
R function lle()
with 2
components
and 10 nearest
neighbors (lle
package)
Separation and
fit not great
Suggests global
behavior more
important than
local for this
time series 0 1 2 3
-4-3-2-101
LLE Results
Dimension 1
Dimension2
t-SNE Plot
R package dimRed
with function
getDimRedData(),
perplexity
(smoothing) at 80,
2 components, and
tsne method
Parses out tipping
points within
growth period and
exact moments of
transitional events
(see green group)
-30 -20 -10 0 10 20 30 40
-30-20-100102030
tSNE Results
Dimension 1
Dimension2
Deep Dive into MDS Components
MDS components separate into prices
(component 1) and fluctuation ranges
(component 2), summarized in
correlation table
Fluctuation ranges increasing as the
market gains points (left)
Original Time Series MDS Component 1 MDS Component 2
open 1.00E+00 3.25E-03
high 1.00E+00 -6.71E-03
low 1.00E+00 9.00E-03
fluctuation.range 6.84E-01 -7.06E-01
close 1.00E+00 -2.56E-03
day.range 5.14E-01 -7.47E-01
adj_close 1.00E+00 -2.41E-03
Transition Periods Deep Dive
Transition
periods
overlap with
long-term
trends
Shorter time-
to-transition
periods in
recent years
Results Overview
NSE shows exponential growth in a time period of changes
 New regulations
 Oil price drops
 Fall of inflation
Tipping points of growth
 Includes current period, starting late 2017/early 2018
 Actually predicted tumble of NSE during February of 2018 in late 2017
 Crash predicted by several economists for sometime in 2018:
 https://www.getmoneyrich.com/indian-stock-market-correction-likely-in-2017-2018/
 https://www.livemint.com/Money/pXdnLHA2r1FJfwJhFEDqjO/Stock-market-crash-Experts-divided-on-whether-theres-more.html
Fluctuations and volatility
 Increasing in past few years
 Can vary a lot during the day while starting and closing with similar values
Conclusions
Clustering and dimensionality reduction for
multivariate data exploration
 Helpful for understanding multivariate time
series data
 Helpful for understanding other types of data
prior to analysis
Performs very well, showing behavior
deviations before major events
Can provide an understanding of covariance
structure (relationships between stocks,
volatility within a market…)
References
Farrelly, C. M. (2017). Dimensionality Reduction Ensembles. arXiv preprint arXiv:1710.04484.
Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013). Morse–smale
regression. Journal of Computational and Graphical Statistics, 22(1), 193-214.
Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrika, 29(1), 1-27.
Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning
research, 9(Nov), 2579-2605.
Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. science, 290(5500), 2323-2326.
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and
intelligent laboratory systems, 2(1-3), 37-52.
ResearchGate profile with folder for talk (data, R code, PPT):
https://www.researchgate.net/profile/Colleen_Farrelly2

More Related Content

PDF
Chapter 7 8051 programming in c
PPTX
Fourier transform (cell phones)
PPTX
Code conversions.pptx415.pptx
PPTX
Digital Components
PPTX
PDF
silent sound technology final report(17321A0432) (1).pdf
PDF
Dimensionality reduction
PDF
Handbook Of Data Visualization Springer Handbooks Of Computational Statistics...
Chapter 7 8051 programming in c
Fourier transform (cell phones)
Code conversions.pptx415.pptx
Digital Components
silent sound technology final report(17321A0432) (1).pdf
Dimensionality reduction
Handbook Of Data Visualization Springer Handbooks Of Computational Statistics...

Similar to High-Dimensional Data Visualization, Geometry, and Stock Market Crashes (20)

PDF
Machine Learning Foundations for Professional Managers
DOCX
Pg. 01Question Three Assignment 1Deadline Satur.docx
PDF
Visualizing and Communicating High-dimensional Data
PDF
Lecture7 xing fei-fei
PDF
Unit4_AML_MTech that has many ML concepts covered
PDF
M2R Group 26
PPTX
EDAB - Principal Components Analysis and Classification -Module - 5.pptx
PDF
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
PDF
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition...
PPTX
Data Exploration.pptx
PPT
Data_exploration.ppt
PPT
chap3_data_exploration with realtimeexample.ppt
PPT
chap3_data_exploration in data science.ppt
PDF
Credit risk meetup
PPT
data clean.ppt
PDF
Handbook Of Statistics 24 Data Mining And Data Visualization Elsevier Cr Rao
PDF
An introductiontoappliedmultivariateanalysiswithr everit
PDF
Lecture-1-Introduction-to-Data-Mining.pdf
PDF
Machine learning for_finance
DOCX
Concept mapping patient initials, age, gender and admitting d
Machine Learning Foundations for Professional Managers
Pg. 01Question Three Assignment 1Deadline Satur.docx
Visualizing and Communicating High-dimensional Data
Lecture7 xing fei-fei
Unit4_AML_MTech that has many ML concepts covered
M2R Group 26
EDAB - Principal Components Analysis and Classification -Module - 5.pptx
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition...
Data Exploration.pptx
Data_exploration.ppt
chap3_data_exploration with realtimeexample.ppt
chap3_data_exploration in data science.ppt
Credit risk meetup
data clean.ppt
Handbook Of Statistics 24 Data Mining And Data Visualization Elsevier Cr Rao
An introductiontoappliedmultivariateanalysiswithr everit
Lecture-1-Introduction-to-Data-Mining.pdf
Machine learning for_finance
Concept mapping patient initials, age, gender and admitting d
Ad

More from Colleen Farrelly (20)

PPTX
Generative AI for Social Good at Open Data Science East 2024
PPTX
Hands-On Network Science, PyData Global 2023
PPTX
Modeling Climate Change.pptx
PPTX
Natural Language Processing for Beginners.pptx
PPTX
The Shape of Data--ODSC.pptx
PPTX
Generative AI, WiDS 2023.pptx
PPTX
Emerging Technologies for Public Health in Remote Locations.pptx
PPTX
Applications of Forman-Ricci Curvature.pptx
PPTX
Geometry for Social Good.pptx
PPTX
Topology for Time Series.pptx
PPTX
Time Series Applications AMLD.pptx
PPTX
An introduction to quantum machine learning.pptx
PPTX
An introduction to time series data with R.pptx
PPTX
NLP: Challenges and Opportunities in Underserved Areas
PPTX
Geometry, Data, and One Path Into Data Science.pptx
PPTX
Topological Data Analysis.pptx
PPTX
Transforming Text Data to Matrix Data via Embeddings.pptx
PPTX
Natural Language Processing in the Wild.pptx
PPTX
SAS Global 2021 Introduction to Natural Language Processing
PPTX
2021 American Mathematical Society Data Science Talk
Generative AI for Social Good at Open Data Science East 2024
Hands-On Network Science, PyData Global 2023
Modeling Climate Change.pptx
Natural Language Processing for Beginners.pptx
The Shape of Data--ODSC.pptx
Generative AI, WiDS 2023.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
Applications of Forman-Ricci Curvature.pptx
Geometry for Social Good.pptx
Topology for Time Series.pptx
Time Series Applications AMLD.pptx
An introduction to quantum machine learning.pptx
An introduction to time series data with R.pptx
NLP: Challenges and Opportunities in Underserved Areas
Geometry, Data, and One Path Into Data Science.pptx
Topological Data Analysis.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
Natural Language Processing in the Wild.pptx
SAS Global 2021 Introduction to Natural Language Processing
2021 American Mathematical Society Data Science Talk
Ad

Recently uploaded (20)

PPT
What is life? We never know the answer exactly
PDF
technical specifications solar ear 2025.
PPT
Technicalities in writing workshops indigenous language
PPTX
lung disease detection using transfer learning approach.pptx
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PPTX
Introduction to Fundamentals of Data Security
PDF
REPORT CARD OF GRADE 2 2025-2026 MATATAG
PPTX
ch20 Database System Architecture by Rizvee
PDF
Teal Blue Futuristic Metaverse Presentation.pdf
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PPTX
Capstone Presentation a.pptx on data sci
PDF
General category merit rank list for neet pg
PPTX
PPT for Diseases.pptx, there are 3 types of diseases
PPTX
research framework and review of related literature chapter 2
PPT
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
PPTX
GPS sensor used agriculture land for automation
PDF
The Role of Pathology AI in Translational Cancer Research and Education
What is life? We never know the answer exactly
technical specifications solar ear 2025.
Technicalities in writing workshops indigenous language
lung disease detection using transfer learning approach.pptx
AI AND ML PROPOSAL PRESENTATION MUST.pptx
inbound2857676998455010149.pptxmmmmmmmmm
Introduction to Fundamentals of Data Security
REPORT CARD OF GRADE 2 2025-2026 MATATAG
ch20 Database System Architecture by Rizvee
Teal Blue Futuristic Metaverse Presentation.pdf
PPT for Diseases (1)-2, types of diseases.pptx
inbound6529290805104538764.pptxmmmmmmmmm
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
Capstone Presentation a.pptx on data sci
General category merit rank list for neet pg
PPT for Diseases.pptx, there are 3 types of diseases
research framework and review of related literature chapter 2
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
GPS sensor used agriculture land for automation
The Role of Pathology AI in Translational Cancer Research and Education

High-Dimensional Data Visualization, Geometry, and Stock Market Crashes

  • 1. Visualizing High Dimensional Data with Manifold Learning in R BY COLLEEN M. FARRELLY, DATA SCIENTIST AT GRAHAM HOLDINGS (KAPLAN HIGHER AND PROFESSIONAL EDUCATION)
  • 2. My Path to Data Science Former MD/PhD student who started doing research/attending workshops in geometry, topology, and machine learning Switched degree programs into biostatistics with a topology-based slant Have worked in biotechnology, military, education, and the social sciences Currently on the business side of running a university, with a lot of financial modeling and risk modeling
  • 3. Mining for Data Relationships Exploratory analysis  Important step in data science projects  Trend/covariance visualization  Clustering  Powerful combination for understanding many types of problems Types of data problems  Time series analyses  Predictive analyses  Network analyses 9 3 13 5 1 7 8 14 10 11 12 6 16 15 17 2 4 0204060 Intelligence and Achievement Dendrogram hclust (*, "complete") dist(mydata[, 2:4]) Height Unique subgroup identified
  • 4. Time Series and Financial Data Key tasks in time series/financial data analyses:  Forecasting future time points  Identifying drivers of the dynamic process (ex. why are sales rising?)  Identifying tipping points (crashes, spikes…)  Identifying covarying behavior (sectors that behave similarly, stocks that influence each other, daily rising/falling patterns…) Dow Jones Industrial Average
  • 5. Morse-Smale Clustering Multivariate technique from topology similar to mode clustering  Find peaks and valleys in data by filtering on a defined function:  A watershed on mountains  Dribbling a soccer ball across a field of hills  Separate data based on shared peaks and valleys  Many nice developments on convergence and theoretical properties R package has nice dimensionality reduction plots to highlight cluster differences with respect to the filter function and predictor sets 5
  • 6. Dimensionality Reduction and Visualization Helpful in visualizing multivariate trends and group differences, particularly for multivariate time series data Assume data lies in a lower-dimensional subspace and map full dataset to that subspace (right) Types of methods:  Linear (principal component analysis, or PCA)  Nonlinear (manifold learning)  Local (preserving neighborhood metrics like distance between points)  Global (preserving global characteristics like connectedness and limits) Manifold learning methods related to a branch of mathematics called differential geometry
  • 7. Manifold Learning Methods Three main methods considered in this analysis:  Multidimensional scaling (MDS)  Global method based on distance preservation and matrix decomposition  Distances can be Euclidean, geodesic, Manhattan...  Nice theoretical result relating it to PCA when best subspace is linear  Locally linear embedding (LLE)  Local method based on nearest neighbor graph, weighting, and matrix decomposition  Related to ISOMAP and other methods  t-distributed stochastic neighbor embedding (t-SNE)  Local and global method based on mapping of probability distributions and random walks  Preserves both local and global characteristics of the original data space  Very strong performance on a variety of problems lately Breast Cancer Dataset Comparison
  • 8. Example Stock Market Dataset Emerging markets  Important for investors  Future drivers of global trade  Global trends  Daily fluctuations  Tipping points (crashes and opportunities) This example:  Recent Kaggle dataset of daily National Stock Exchange of India prices from July 2003- February 2018:  https://www.kaggle.com/abhishekyana/nse-listed- 1384-companies-data/data  Cleaned (nulls removed, <1%) and daily fluctuation ranges added (7 total time series columns)  3616 days included
  • 9. Clustering Results R package (msr)  10 nearest neighbors  Persistence level=1  5 level splits  Plot of group trajectories (far left) 4 distinct groups  2 represent stable trends (red, blue)  2 represent transition points in market behavior (green, aqua)
  • 10. PCA Plot R function princomp() with 2 components Fits quite well and shows spread within each cluster
  • 11. MDS Plot R function cmdscale() with 2 components and a Euclidean distance metric Relationships very linear and well-separated globally  Matches PCA well  Separates into: 1. Daily price 2. Daily fluctuation 0 5000 10000 15000 -600-400-2000 MDS Results Dimension 1 Dimension2
  • 12. LLE Plot R function lle() with 2 components and 10 nearest neighbors (lle package) Separation and fit not great Suggests global behavior more important than local for this time series 0 1 2 3 -4-3-2-101 LLE Results Dimension 1 Dimension2
  • 13. t-SNE Plot R package dimRed with function getDimRedData(), perplexity (smoothing) at 80, 2 components, and tsne method Parses out tipping points within growth period and exact moments of transitional events (see green group) -30 -20 -10 0 10 20 30 40 -30-20-100102030 tSNE Results Dimension 1 Dimension2
  • 14. Deep Dive into MDS Components MDS components separate into prices (component 1) and fluctuation ranges (component 2), summarized in correlation table Fluctuation ranges increasing as the market gains points (left) Original Time Series MDS Component 1 MDS Component 2 open 1.00E+00 3.25E-03 high 1.00E+00 -6.71E-03 low 1.00E+00 9.00E-03 fluctuation.range 6.84E-01 -7.06E-01 close 1.00E+00 -2.56E-03 day.range 5.14E-01 -7.47E-01 adj_close 1.00E+00 -2.41E-03
  • 15. Transition Periods Deep Dive Transition periods overlap with long-term trends Shorter time- to-transition periods in recent years
  • 16. Results Overview NSE shows exponential growth in a time period of changes  New regulations  Oil price drops  Fall of inflation Tipping points of growth  Includes current period, starting late 2017/early 2018  Actually predicted tumble of NSE during February of 2018 in late 2017  Crash predicted by several economists for sometime in 2018:  https://www.getmoneyrich.com/indian-stock-market-correction-likely-in-2017-2018/  https://www.livemint.com/Money/pXdnLHA2r1FJfwJhFEDqjO/Stock-market-crash-Experts-divided-on-whether-theres-more.html Fluctuations and volatility  Increasing in past few years  Can vary a lot during the day while starting and closing with similar values
  • 17. Conclusions Clustering and dimensionality reduction for multivariate data exploration  Helpful for understanding multivariate time series data  Helpful for understanding other types of data prior to analysis Performs very well, showing behavior deviations before major events Can provide an understanding of covariance structure (relationships between stocks, volatility within a market…)
  • 18. References Farrelly, C. M. (2017). Dimensionality Reduction Ensembles. arXiv preprint arXiv:1710.04484. Gerber, S., Rübel, O., Bremer, P. T., Pascucci, V., & Whitaker, R. T. (2013). Morse–smale regression. Journal of Computational and Graphical Statistics, 22(1), 193-214. Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1), 1-27. Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500), 2323-2326. Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52. ResearchGate profile with folder for talk (data, R code, PPT): https://www.researchgate.net/profile/Colleen_Farrelly2