SlideShare a Scribd company logo
ENGINEERING A DATA
SCIENTIST
Aron Ahmadia, Ph. D,
aron@ahmadia.net https://aron.ahmadia.net
Columbia University SIAM Seminar
2016.09.08
Engineering a Data Scientist
“More than anything, what data scientists do is
make discoveries while swimming in data. It’s
their preferred method of navigating the world
around them. At ease in the digital realm, they
are able to bring structure to large quantities of
formless data and make analysis possible. They
identify rich data sources, join them with other,
potentially incomplete data sources, and clean
the resulting set. In a competitive landscape
where challenges keep changing and data
never stop flowing, data scientists help decision
makers shift from ad hoc analysis to an ongoing
conversation with data..”“Data Scientist: The Sexiest Job of the 21st Century” - HBR 2012
design systems while considering the
limitations imposed by practicality, regulation,
safety, and cost.
Their work forms the link between scientific
discoveries and their subsequent applications to
human needs and quality of life
EngineersData Scientists
http://gastonsanchez.com/opinion/2014/03/18/March-BARUG-meetup/
http://chance.amstat.org/2011/09/guatemala/
DataShader J. Bednar, Continuum Analytics
https://pbs.twimg.com/profile_images/535580673201815552/-FujMxzL.jpeg
https://www.fastcompany.com/1806764/generation-flux-dj-patil
http://www.stat.purdue.edu/~sunz/Jeff_2014/index_chn.html
Capital One Financial
Big Data: The Next Frontier for Innovation, Competition, and
Productivity McKinsey Global Institute, 2011
“Furthermore, this type of talent is difficult to
produce, taking years of training in the case of
someone with intrinsic mathematical abilities.”
http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation
link here
https://www.google.com/about/careers/search#!t=jo&jid=21965002&
http://web.worldbank.org/external/default/main?pagePK=8454041&piPK=8454059&theSitePK=8453353&JobNo=161923
Engineering a Data Scientist
Engineering a Data Scientist
Engineering a Data Scientist
link here https://www.facebook.com/careers/jobs/a0I1200000IAM1fEA
link here
COMPUTER SCIENCE
PREPARATION
• Fundamentals of programming (3 credits) COMS
W1006
• Data Structures (3 credits) COMS 3134
• Also valuable:
• Algorithms - COMS 3137
• CS Core
• CS Intelligent Systems Track
MATHEMATICAL/STATISTIC
AL PREPARATION
• Linear Algebra APMA 3101
• Statistics STAT 4109
• Also valuable:
• Bayesian Modeling - STAT 6102/6103
• Numerical Methods/Optimization - APMA
4300/4301
PROGRAMMING
LANGUAGE EXPERIENCE
• In general, any programming language
experience will serve you, but some programming
languages will serve you better than others
• Python and R are unquestionably the most
popular languages for data science, Julia’s star is
still rising
• Java, Scala, Javascript, C/C++, and Lua are not
to be overlooked
link here
link here http://www.tiobe.com/tiobe-index/
http://pgbovine.net/tech-privilege.htm
HANDS-ON EXPERIENCE
• You must treat programming languages like real
languages, you will get more and more comfortable
and fluent using them with practice
• In addition, you should also get comfortable working in
a GNU/Linux command line environment
• There is no better way to get started than a Software
Carpentry Workshop, a hands-on 2-day workshop that
will get you started in Bash, Python or R, and Git.
DEEPER EXPERIENCE
• Look for opportunities to build models, work
with messy data, web data and APIs,
geospatial visualizations, take classes, give
presentations, and teach these tools
• Get together with some friends and work on
free Data Science Competitions such as the
Cortana Intelligence Competitions and
Kaggle
DAILY SKILLS ON MY TEAM
• Python, Bash, and Git on the command line
• Python, NumPy, Scikit-learn, Matplotlib, and pandas
in PyCharm and the Jupyter Notebook
• Amazon Web Services: EC2, S3, KMS, Lambda
• Load up multiple messy datasets, clean them, make
sense of them, then present business opportunities
to other analysts and executives
MORE RESOURCES
• Free online sites for learning to code:
• Code Academy
• HackerRank
• Software Carpentry
• Fundamental Books
• Code Complete (McConnell)
• Numerical Linear Algebra (Trefethen and Bau)
• Elements of Statistical Learning (Hastie, Tibshirani, Friedman)
• Doing Data Science (O’Neill, Schutt)
LOST SUITCASE PROBLEM
T. OLIPHANT & D. BEAZLEY
https://c1.staticflickr.com/1/164/401543806_1513447ad8_b.jpg
http://www.chicagobus.org/buses/5300/photos
• Retrieve Travis’s suitcase from a Chicago
city bus by alerting him whenever a bus on
route 22 is due near his stop.
• Could you write the code? How long would it
take you?
• A really long time the first time you did
anything like it, but if you didn’t know that
when you started, you’d probably have a
great time along the way :)

More Related Content

PPTX
Future of the article C Mavergames March 2013
Cochrane.Collaboration
 
PDF
Abreu portfolio
Amelia Abreu
 
PPT
Semantic.edu, an introduction
Bryan Alexander
 
PDF
Requirements Engineering for the Humanities
Shawn Day
 
PDF
Tools for Digital Humanities Scholarly Innovation: Timemap, Juxtapose, Story Map
Shawn Day
 
PDF
Observing Web Archives: The Case for an Ethnographic Study of Web Archiving
Jessica Ogden
 
PDF
Teaching AI in data journalism
Paul Bradshaw
 
PDF
Observing Archives: Web Archiving as Socio-technical Practice
Jessica Ogden
 
Future of the article C Mavergames March 2013
Cochrane.Collaboration
 
Abreu portfolio
Amelia Abreu
 
Semantic.edu, an introduction
Bryan Alexander
 
Requirements Engineering for the Humanities
Shawn Day
 
Tools for Digital Humanities Scholarly Innovation: Timemap, Juxtapose, Story Map
Shawn Day
 
Observing Web Archives: The Case for an Ethnographic Study of Web Archiving
Jessica Ogden
 
Teaching AI in data journalism
Paul Bradshaw
 
Observing Archives: Web Archiving as Socio-technical Practice
Jessica Ogden
 

What's hot (10)

PDF
UI design for open data
Hollie Lubbock
 
PPT
Reach Out to Research : library support services (R2R)
Guus van den Brekel
 
PPTX
Making Sense of Digital Humanities: a Conversation Starter
University of Cape Town
 
PDF
Making the web work for science - eResearch nz
Kaitlin Thaney
 
PPT
Bibliotheek & Onderzoek 2.0?
Guus van den Brekel
 
PDF
Fail! workshop introduction at Web Science Conference
Katrin Weller
 
PDF
MPhil Lecture of Data Vis for Presentation
Shawn Day
 
PDF
Structured data EuroIA IV
Paul Kahn
 
PDF
Ethics of Automation
David De Roure
 
PPTX
Science and social media
DistribEcology
 
UI design for open data
Hollie Lubbock
 
Reach Out to Research : library support services (R2R)
Guus van den Brekel
 
Making Sense of Digital Humanities: a Conversation Starter
University of Cape Town
 
Making the web work for science - eResearch nz
Kaitlin Thaney
 
Bibliotheek & Onderzoek 2.0?
Guus van den Brekel
 
Fail! workshop introduction at Web Science Conference
Katrin Weller
 
MPhil Lecture of Data Vis for Presentation
Shawn Day
 
Structured data EuroIA IV
Paul Kahn
 
Ethics of Automation
David De Roure
 
Science and social media
DistribEcology
 
Ad

Viewers also liked (18)

DOCX
Acc 403 assignment 2 audit planning and control
bestwriter
 
DOCX
Tman 625 final examination answers
bestwriter
 
PPTX
Alianca da ciencia com a religiao
Henrique Vieira
 
PDF
Exegesis exposicion
MARYCIELO RODRIGUEZ
 
PPT
Բնագիտություն
093224445
 
PDF
Premier Health Exam 2017
Shaqula Taylor
 
PDF
Creo en la_resurreccion_de_jesus
MARYCIELO RODRIGUEZ
 
PDF
Victoria sobre la_oscuridad
MARYCIELO RODRIGUEZ
 
PPT
Thermography early clinical correlations
Society for Heart Attack Prevention and Eradication
 
PDF
Bergunder movimiento pentecostal
MARYCIELO RODRIGUEZ
 
PDF
Mute Java EE DNA with CDI
Antoine Sabot-Durand
 
DOCX
Infografia de medidas de seguirdad
Brandon Adrian Flores Ballesteros
 
PDF
Apttus_Certificate (1)
Garima Sharma
 
PDF
L. j. thompson_-_el_arte_de_ilustrar_sermones
MARYCIELO RODRIGUEZ
 
PPTX
La practica
Juan Felipe Pino Valdez
 
DOCX
Hrm 320 ( employment law ) week 2 assignment
bestwriter
 
DOCX
Los siete dones del espíritu santo son
MARYCIELO RODRIGUEZ
 
DOCX
Proj 592 week 2 course project
bestwriter
 
Acc 403 assignment 2 audit planning and control
bestwriter
 
Tman 625 final examination answers
bestwriter
 
Alianca da ciencia com a religiao
Henrique Vieira
 
Exegesis exposicion
MARYCIELO RODRIGUEZ
 
Բնագիտություն
093224445
 
Premier Health Exam 2017
Shaqula Taylor
 
Creo en la_resurreccion_de_jesus
MARYCIELO RODRIGUEZ
 
Victoria sobre la_oscuridad
MARYCIELO RODRIGUEZ
 
Thermography early clinical correlations
Society for Heart Attack Prevention and Eradication
 
Bergunder movimiento pentecostal
MARYCIELO RODRIGUEZ
 
Mute Java EE DNA with CDI
Antoine Sabot-Durand
 
Infografia de medidas de seguirdad
Brandon Adrian Flores Ballesteros
 
Apttus_Certificate (1)
Garima Sharma
 
L. j. thompson_-_el_arte_de_ilustrar_sermones
MARYCIELO RODRIGUEZ
 
Hrm 320 ( employment law ) week 2 assignment
bestwriter
 
Los siete dones del espíritu santo son
MARYCIELO RODRIGUEZ
 
Proj 592 week 2 course project
bestwriter
 
Ad

Similar to Engineering a Data Scientist (20)

PPTX
Machines are people too
Paul Groth
 
PPTX
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
National Information Standards Organization (NISO)
 
PDF
Tds — big science dec 2021
Gérard Dupont
 
PDF
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinar
Christina Silver
 
PDF
Big Data & DS Analytics for PAARL
Philippine Association of Academic/Research Librarians
 
PDF
Intro to Data Science for Non-Data Scientists
Sri Ambati
 
PDF
Prototyping Python Dashboards for Scientists and Engineers: Build and Deploy ...
sutteelajim
 
PDF
#ALAAC15 Linked Data Love
Kristi Holmes
 
PPTX
Digital Humanities Workshop
Kevin J. Comerford, University of New Mexico
 
PPTX
20171003 lancaster data conversations Chue-Hong
Lancaster University Library
 
PPTX
2016 05 sanger
Chris Dwan
 
PPTX
The Future of Data Science
sarith divakar
 
PDF
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
PDF
Abcd iqs ssoftware-projects-mercecrosas
Merce Crosas
 
PPTX
The Ai & I at Work
Tarek Hoteit
 
PPTX
Final Johnson Research Libraries and Computational Research
National Information Standards Organization (NISO)
 
PDF
Data science presentation
MSDEVMTL
 
PPTX
JEAA Presentation - Why Should Communicators Learn to Code?
Cindy Royal
 
PDF
Digital Science: Towards the executable paper
Jose Enrique Ruiz
 
PPTX
Big Data and the Future of Publishing
Anita de Waard
 
Machines are people too
Paul Groth
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
National Information Standards Organization (NISO)
 
Tds — big science dec 2021
Gérard Dupont
 
Qualitative AI : Hoo-ha or Step-Change? CAQDAS webinar
Christina Silver
 
Intro to Data Science for Non-Data Scientists
Sri Ambati
 
Prototyping Python Dashboards for Scientists and Engineers: Build and Deploy ...
sutteelajim
 
#ALAAC15 Linked Data Love
Kristi Holmes
 
20171003 lancaster data conversations Chue-Hong
Lancaster University Library
 
2016 05 sanger
Chris Dwan
 
The Future of Data Science
sarith divakar
 
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
Abcd iqs ssoftware-projects-mercecrosas
Merce Crosas
 
The Ai & I at Work
Tarek Hoteit
 
Final Johnson Research Libraries and Computational Research
National Information Standards Organization (NISO)
 
Data science presentation
MSDEVMTL
 
JEAA Presentation - Why Should Communicators Learn to Code?
Cindy Royal
 
Digital Science: Towards the executable paper
Jose Enrique Ruiz
 
Big Data and the Future of Publishing
Anita de Waard
 

Recently uploaded (20)

PPT
Ppt for engineering students application on field effect
lakshmi.ec
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
Tunnel Ventilation System in Kanpur Metro
220105053
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PDF
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PPTX
easa module 3 funtamental electronics.pptx
tryanothert7
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
Ppt for engineering students application on field effect
lakshmi.ec
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
Tunnel Ventilation System in Kanpur Metro
220105053
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time Data
Soufiane Sejjari
 
Inventory management chapter in automation and robotics.
atisht0104
 
easa module 3 funtamental electronics.pptx
tryanothert7
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 

Engineering a Data Scientist

Editor's Notes

  • #3: Data Scientist: The Sexiest Job of the 21st Century - 2012 report in Harvard Business Review by Thomas Davenport and DJ Patil
  • #5: Definition lifted from Wikipedia
  • #6: So what do data scientists do? This is Megan Price. She’s the co-founder and director of research of Human Rights Data Analysis Group (HRDAG). Simply put HRDAG is a non-profit, non-partisan organization that applies rigorous science to the analysis of human rights violations around the world. The organization has published findings on conflicts in Syria, Colombia, Chad, Kosovo, Guatemala, Perú, East Timor, India, Liberia, Bangladesh, and Sierra Leone. The organization provided testimony in the war crimes trials of Slobodan Milošević and Milan Milutinović at the International Criminal Tribunal for the former Yugoslavia, and in Guatemala's Supreme Court in the trial of General José Efraín Ríos Montt, the de facto president of Guatemala in 1982-1983. Gen. Ríos was found guilty of genocide and crimes against humanity. Most recently, the organization has published on police violence in the United States.
  • #7: (From 2011 article on data science for Guatemala case) Daniel Guzman, working for HRDAG, performed an analysis of random samples drawn from the millions of documents in the Guatemalan National Police Archive. Held what archivists estimate to be 8 kilometers, or approximately 80 million sheets, of paper. Many of the police documents were created during the country’s internal armed conflict from 1960 to 1996, during which tens of thousands of Guatemalans disappeared.
  • #8: Another example, the DataShader project from Continuum Analytics. Pickups and drop-offs of around ten million taxi cab rides in New York City. Red lines are pickups, green lines are dropoffs, the image has been dynamically ranged to maximize the amount of information your eyes can perceive
  • #9: Tony Hey - Author of The Fourth Paradigm: Data-Intensive Scientific Discovery, published in 2009 (one of original authors of MPI for those who care about these things :) The fourth paradigm is established as “data-intensive” science, instead of the traditional paradigms of theory, experiment, and computation.
  • #10: DJ Patil - Chief Data Scientist at United States Office of Science and Technology Policy, first to call himself “data scientist” (while at LinkedIn), Jeff Hammerbacher in 2008 (was at Facebook and did the same)
  • #11: Jeff Wu, popularized the term “data science” in his inaugural talk to University of Michigan to the H.C Carver Professorship, in November of 1997, where he argued that statistics is data science (and should start treating itself that way).
  • #12: Richard Fairbank - Devised information-based strategy in 1994, launched Capital One, which completely disrupted the credit card industry and now has a market capitalization of 36 billion dollars.
  • #13: This is a quote from the 2011 MGI report on Big Data regarding a forecast for needing data scientists.
  • #14: In 2018, there will be a shortage of between 140 and 190 thousand graduates with “deep analytical talent” By the way, 1.5 million jobs? That’s a misinterpretation of a later quote that 1.5 million analysts and managers will need to be retrained to *consume* this data.
  • #15: Google doesn’t want engineers as data scientists.
  • #16: Neither does the World Bank :(
  • #18: There are other forms of this diagram, but the key point is that data science is fundamentally interdisciplinary and requires non-trivial experience in multiple domains. Look in the middle! The unicorn! That’s you!
  • #19: Tesla career posting for data scientist
  • #20: Facebook career posting for data scientist.
  • #21: Capital One posting for a data scientist.
  • #22: An example course schedule from Columbia University.
  • #25: PYPL - Uses Google tutorial search trends to score languages
  • #26: When assessing these indices, remember that established, general languages (Java, Python, C++) enjoy greater popularity due to their versatility and venerability. But there is nothing wrong with smaller, “boutique” languages, and they are often more pleasant to work with, if a little harder to find answers What’s older, Python or Java? Java was born in 1995, Python 1991 (Numeric in 1995, NumPy in 2006), R is from 1993, but based on S, which first appeared in 1976, Julia was born in 2012.
  • #27: “I started programming when I was five, first with Logo and then BASIC. By the time this photo was taken, I had already written several BASIC games that I distributed as shareware on our local BBS. I was fast growing bored, so my parents (both software engineers) gave me the original dragon compiler textbook from their grad school days. That's when I started learning C and writing my own simple interpreters and compilers. My early interpreters were for BASIC, but by the time I entered high school, I had already created a self-hosting compiler for a non-trivial subset of C (no preprocessor, though). Throughout most of high school, I spent weekends coding in x86 assembly, obsessed with hand-tuning code for the newly-released Pentium II chips. When I started my freshman year at MIT as a Computer Science major, I already had over ten years of programming experience. So I felt right at home there.” Is this a data scientist? Nope! It’s Philip Guo, Professor of Computer Science. And he had almost no computer programming experience before he got to MIT, where he majored in Computer Science! It’s okay if you don’t know how to program yet, but if you want to be a data scientist…
  • #28: You need hands-on experience.
  • #29: And deeper experience if you can get it. See the September 2016 Cortana Intelligence Competition for a motivating problem with non-trivial structure.
  • #30: Here’s an example of the skills and experiences people on the Machine Intelligence team use at Capital One.
  • #31: Some more non-course resources.
  • #32: A real-life example of the working with messy data.
  • #33: This is Travis Oliphant, cofounder of Continuum Analytics, and his suitcase.
  • #34: Travis left his suitcase on a Chicago city bus while visiting David Beazley.
  • #35: This is a story about a time when Travis came to visit David Beazley. Travis flew into Chicago where David's office is, and took public transportation to David's office. The last leg of his trip was on bus route 22. When Travis arrived, he realized he left his suitcase on the bus. A Chicago city bus. Mere mortals would kiss that bag goodbye, but not David and Travis. They went up to the office and wrote a quick Python program to poll the city data for all of the CTA buses, track the location of all the buses on route 22, and pop open a browser window whenever one of the route 22 buses approached within a half mile of David's office. Then they'd run outside and check for Travis' suitcase. It took about eight hours until the right bus came by again, but there was his bag. The driver didn't even blink when these two guys got on the bus, grabbed a suitcase, and got back off.