SlideShare a Scribd company logo
 
Just the Basics: Core
Data Science Skills	
  
	
  
William Cukierski, PhD!
will.cukierski@kaggle.com!


Ben Hamner

ben.hamner@kaggle.com!
	
  
	
  




                             Photo	
  by	
  mikebaird,	
  www.flickr.com/photos/mikebaird	
  
JUST the basics!

We mean the basics!
 –  Ask dumb questions!
    (we’ll give dumb answers)!
 –  We can’t be comprehensive, but
    we can omit pretense and
    jargon!
 –  Expect a little Python, R,
    Matlab, Excel, command line,
    hand-waving!
Pronounced Kah-gull (as in waggle),

            not Kegel (as in bagel)
                                  !
Before we get started!

  You’ll need a Kaggle account
                             !       Create a team for the competition!
   www.kaggle.com/account/register
                                 !   www.kaggle.com/c/just-the-basics-strata-2013!
                 !                   Add (Strata) to the end of your team name!!
                                     e.g. – William Cukierski (Strata)

                                     !
Agenda:
Preliminaries
Identifying a Problem

Performing the analysis

Visualizing the Solution

Contest!!
Will background!
Physics & Biomedical Engineering!
    –  Studied machine learning for
       diagnosis of pathology images!
    –  Constantly reinventing sophomore-
       level CS concepts!
Former “successful” machine learning
competitor!
    –  Successful?!
        •  Finished near top?!
        •  Got me a job?!
        •  Fooled people into believing I
           understand stats

           (a.k.a. “data scientist”)!
Ben Background!
Biomedical Engineering & Electrical
Engineering!
   –  Applied machine learning to improve
      brain-computer interface!
   –  Software development in various
      languages / domains!
Machine learning competitions!
   –  Top finishes in many 2010-2011!
   –  Teamed up with Will on several!
   –  Switched to the dark side, spent much
      of the past year designing competitions
      at Kaggle!                                Driving a Brain-Controlled Wheelchair
The unfortunate hype of modern analytics!
•    BIG DATA!!
•    Every second 6.2 trillion exabytes of data are being collected!
•    Need shared vocabulary, shared protocols!
•    Need to leverage!
      –    weather reports!
      –    surveys!
      –    text documents!
      –    human genomes!
      –    regulatory information!
      –    cell phone logs!
      –    satellite surveillance !
      –    etc.!
      –    etc.!
      –    etc.!
What do we do about it?!

•  Create committees, consortiums, taxonomies, platforms,
   frameworks, clouds!
•  Create acronyms for our committees, consortiums,
   taxonomies, platforms, frameworks, clouds!
•  Go to conferences to promote and learn about our
   acronym’d things!
•  And if time permits and the mood strikes?!

                                            work
Just the basics_strata_2013
I’m ready
 to leave
  now !
Big Data Barry!
Lives by the Shirky Principle:!
     Preserving the problem to which he is
     the solution!


Favorite talking points!
     Data provenance, data warehousing,
     data privacy, data regulations, data
     silos, need for standards, need for
     standards on standards of standards,
     lack of data correctness, need for
     communication!

                                             Source: http://mojette.deviantart.com/!
Listen, I’ve been in this field for 22 years. The
   Bayesian guys in the modeling group are never gonna
   talk to the IT guys because they don’t speak the same
  language. In my 22 years of experience, what we need
 are tighter standards around what the processes should
  be for requesting data, how that data should be stored,
   and who should have access to the data. Also privacy.
      Privacy is a thing about which I have no clue, but
   nonetheless I’m compelled to steamroll even the most
 benign use of our data for anything beyond occupying a
    database. Oh, and speaking of databases and my 22
 years of experience, we need stricter governance about
the schemas a policies that inform the ways the data gets
       federated, so the model guys will stop trying to
             implement things that’ll never work.…!
Seriously,
guys, let
me out !
The plight of the data scientist!

Job description:!
Data Scientists (n.) Person who is
better at statistics than any software
engineer and better at software
engineering than any statistician.!
!
Job reality:!
Data Scientists (n.) Person who is
worse at statistics than any statistician
and worse at software engineering than
any software engineer.!
!
!
Just the basics_strata_2013
This problem can only be
 solved by an 8th-order                                     I’m making an Excel VBA
kernel projection onto an                                   script to access our Oracle
  orthonormal space of                                      database and find the mean
     homoscedastic                                          of the revenue column!
      eigentensors



                                            Data science
                                            (noun): Statistics
                                            done wrong
                     The boss is going to
                      have my neck if I
                   can’t get this Hadoop
                    iPhone app ready in
                       time for Strata
Data science

The application of scientific experimentation (hypothesis
testing, model generation, statistical analysis) in problem-
agnostic ways. !
!
Not data science!
{infographics, apps, site architecture, sending JSON
thingies around, Javascript frameworks, web analytics,
plotting tweets on maps, cloud storage, domains that end
in .io, any idea/thing/product that touches data}!
Agenda:
Preliminaries
Identifying a Problem

Performing the analysis

Visualizing the Solution

Contest!!
Optimization                       What’s the best the can happen?


                 Predictive Modeling                What will happen next?
                                                                                                                     Analytics
                 Forecasting/extrapolation          What if these trends continue?
Sophistication




                 Statistical analysis               Why is this happening?


                 Alerts                             What actions are needed?


                 Query/drill down                   What exactly is the problem?
                                                                                                                     Access and
                 Ad hoc reports                     How many, how often, where?                                      reporting

                 Standard reports                   What happened?

                                             Gain          Source: Competing on Analytics, Davenport/Harris, 2007!
When to use data!

Asking specific questions is mostly harmless!
   –  How many users bought shampoo X at store Y last quarter?!
Prediction is not a free lunch!
   –  Being data-driven and wrong is easy and bad!
   –  Fancy models should serve fancy questions!
       •  Don’t forecast something that can be measured!
Human knowledge precedes machine knowledge!
   –  Sometimes black boxes work!
   –  Often, they don’t: earthquakes, finance models, etc.!
When to use data!

Human experts are good at generalization!
!
Human experts are bad at!
   –    Accurate predictions!
   –    Estimating the uncertainty of their predictions!
   –    Making the same prediction under the same evidence!
   –    Updating predictions in the face of new evidence!
   –    Ignoring unrelated evidence!
http://www.nytimes.com/interactive/science/rock-paper-scissors.html!
We need to teach the computer to generalize




                              laptop:~ wcuk$ RUN IT’S A BEAR
                              -bash: BEAR: threat not found
…without overfitting


                       laptop:~ wcuk$ RUN IT’S A BEAR
                       run: Must specify one of –black –grizzly –teddy
                       laptop:~ wcuk$ RUN IT’S A BEAR -grizzly
                       run: Are you sure you want to run? (y/n)
                       y
                       run: Enter the bear’s name:
                       Rupert
                       run: Is it Rupert with the scar on his ear? He’s
                       cool. He’s more of a salmon kind of bear. (y/n):
                       n
                       run:...RUN!!!!!!!
Storing data!




      Binary!             Text!          Database!


“If you wish to make an apple pie from scratch, you must
first invent the universe.” – Carl Sagan!
Reading data into a useful format!

 We overcomplicate storage and formats!
     –  Databases are quite often a bad choice!
     –  Most data science is a batch process on tabular data!
     –  Your debugging cycle should be fast

        !
 Why text?!
     –    Simple!
     –    Universal!
     –    Fast (to read/write/debug)!
     –    Transparent!
Most data is not useful for scientific experimentation!
Too “macro” (lacking causal detail)!   Meant for human consumption!
Structured data is not always machine ready !
                                    Game 1
                                         !             Game 2!
                 Seat 1: Solracca ($95.30 in chips)
                                                  
    Seat 1: Kingcovey ($108.65 in chips)
                Seat 2: BrickT63 ($127.10 in chips)
                                                  
    Seat 3: VoronIN_exe ($119.80 in chips)
              Seat 3: sven160482 ($184.30 in chips)
                                                  
    Seat 4: ehle123 ($104 in chips)
                  Seat 4: Adelantez ($103 in chips)
                                                  
    Seat 5: MercuriusAA ($107.60 in chips)
            Seat 6: manfred zeal ($155.50 in chips)
                                                  
    Seat 6: budapestkin ($133.15 in chips)
                  Solracca: posts small blind $0.50
                                                  
    budapestkin: posts small blind $0.50
                       BrickT63: posts big blind $1
                                                  
    Kingcovey: posts big blind $1
                                 *** HOLE CARDS ***
                                                  
    *** HOLE CARDS ***
                        sven160482: raises $1 to $2
                                                  
    VoronIN_exe: raises $2 to $3
                    Adelantez: raises $5.50 to $7.50
                                                   
   ehle123: folds
                                 manfred zeal: folds
                                                   
   MercuriusAA: folds
                                     Solracca: folds
                                                   
   budapestkin: calls $2.50
                                     BrickT63: folds
                                                   
   Kingcovey: folds
                                   sven160482: folds
                                                   
   *** FLOP *** [7c Tc Ks]
          Uncalled bet ($5.50) returned to Adelantez
                                                   
   budapestkin: checks
                  Adelantez collected $5.50 from pot
                                                   
   VoronIN_exe: bets $4.45
                                     *** SUMMARY ***
                                                   
   budapestkin: calls $4.45
                           Total pot $5.50 | Rake $0
                                                   
   *** TURN *** [7c Tc Ks] [8c]
                 Seat 4: Adelantez collected ($5.50)
                                                   
   budapestkin: checks
                                                       VoronIN_exe: checks
                                                       *** RIVER *** [7c Tc Ks 8c] [Kc]
                                                       budapestkin: bets $11
                                                       VoronIN_exe: folds
                                                       Uncalled bet ($11) returned to budapestkin
                                                       budapestkin collected $15.15 from pot
                                                       *** SUMMARY ***
                                                       Total pot $15.90 | Rake $0.75
                                                       Seat 6: budapestkin collected ($15.15)
A word of caution on scraping!
•  Scraping is time intensive, unleveraged, brittle!
•  Before you code, research existing libraries!!
    –  Will solve 95% of the problems you don’t even know you will have!
    –  E.g. web scraping using python’s BeautifulSoup!
   page = urllib2.urlopen("http://www.kaggle.com/competitions")
   soup = BeautifulSoup(page.read())
   
   allLinks = soup.find_all('a')
   allLinks = uniqify(allLinks)
   
   for link in allLinks:
       match = (re.search('^/c/.*', link.get('href')))
       if match:
        
fileName = link.get('href');
        
fileName = fileName.replace('/','_') + ".zip"
        
fileName = fileName[3:]
        
getStuff(fileName, "http://www.kaggle.com" + link.get("href") + "/publicleaderboarddata.zip")
Excel has a time and place!
   –  Looking at data!
   –  Pivot tables!
   –  Quick plots to verify things!
Never:!
   –  Pass spreadsheets around!
   –  “Code” in Excel!
   –  Create workflows that require copy/
      pasting data around!
Excel
    !
Agenda:
Preliminaries
Identifying a Problem

Performing the analysis

Visualizing the Solution

Contest!!
Command line!
Glossary!

features = attributes = independent variables!

targets = gold standard = ground truth = dependent variable(s)!

training set = data & targets use to train a model!

validation set = data & targets used as feedback in model training!

test set = separate data & targets used only to evaluate the model!

cross validation = partitioning the training set to estimate how well a
model will generalize!
Feature
    Read!                 Learn!
            Extraction!




Train!
                          Generalize!




Test!
Bayes theorem!

How to update beliefs in the face of evidence?!
For proposition A and evidence B:!
                                                        P (B|A)P (A)
    –  P(A) = prior (belief in A)!            P (A|B) =
                                                            P (B)
   –  P(B) = evidence!
   –  P(A | B) = posterior (belief in A given B)!
   –  P(B | A) = likelihood!


                              P (long hair|f emale)P (f emale)
      P (f emale|long hair) =
                                       P (long hair)
R!
MATLAB!
Agenda:
Preliminaries
Identifying a Problem

Performing the analysis

Visualizing the Solution

Contest!!
Just the basics_strata_2013
Visualization!

Speak the language of your audience!
    –  Use simple plots!
    –  Use units that matter (dollars, time, widgets)!
    –  Include the units!!
    –  Don’t use acronyms!!
!
Most visualization should be internal facing (am I doing this
right?) and not external facing (hey check this out!)!
•  Babysitting model performance!
•  Plotting raw features!                •  Looking for optima!
•  Looking for outliers,                 •  Watching for sensitivity to initial
   anomalies, correlation!                  conditions, perturbations!


                •  Verifying feature selection or                  •  Summarizing!
                   dimensionality reduction!                       •  Checking the result is reasonable!
                •  Looking at manifold density!                    •  Comparisons to the alternative!
                •  Looking at class separation!
Your job is to solve a problem!
  –  Sell the message, not the graphic!
Avoid chartjunk!
    “The purpose of decoration varies — to make the graphic appear
    more scientific and precise, to enliven the display, to give the
    designer an opportunity to exercise artistic skills. Regardless of
    its cause, it is all non-data-ink or redundant data-ink, and it is
    often chartjunk.” –Edward Tufte!
source: http://i.dailymail.co.uk/i/pix/2012/03/21/article-2118152-124602BE000005DC-0_964x528.jpg
source: http://www.fivethirtyeight.com/2009/10/older-and-wealthier-people-are-more.html
Election fraud: 2D histograms of the number of units for a given voter turnout
(x axis) and the percentage of votes (y axis) for the winning party!




                         source: http://www.pnas.org/content/early/2012/09/20/1210722109.abstract
ggplot2!
Agenda:
Preliminaries
Identifying a Problem

Performing the analysis

Visualizing the Solution

Contest!!
Make a spam detector!
The data represents a corpus of emails. Some are spam and
some are normal.!
•  Due to time constraints, feature extraction is done for you:!
   –  train.csv - contains 600 emails x 100 features!
   –  train_labels.csv – contains the 600 training labels (1 = spam, 0 =
      normal)!
   –  test.csv - contains 4000 emails x 100 features!
•  Submit a file with each of the 4000 predictions on a separate
   line (in the same order as test.csv).!
   –  No header is necessary!
   –  Predictions can be continuous numbers or 0/1 labels!
How the leaderboard works!

                 Return%	
     ProductID	
       Dept	
         Price	
      MFR	
  
                   1.94	
        54323	
       Household	
     54.95	
       USA	
  
                  0.023	
        92356	
       Household	
       9.95	
      USA	
  
                    0.8	
        78023	
       Computer	
        4.5	
       China	
  
                   0.01	
        12340	
         Audio	
       109.99	
      China	
  
                   0.41	
        31240	
         Audio	
       29.99	
      Taiwan	
  
                   0.97	
        12351	
       Hardware	
      54.95	
      Mexico	
  
                  0.0115	
       90141	
       Hardware	
        4.99	
      USA	
  
                    0.4	
        81240	
       Hardware	
        6.55	
     Taiwan	
  
                   0.03	
        14896	
       Computer	
      211.99	
     Korea	
  
                  0.205	
        62132	
       Computer	
       1100	
       USA	
  
                  1.6878	
       54323	
         Audio	
       34.99	
       USA	
  
                  0.0345	
       92356	
         Audio	
         7.99	
      USA	
  
                   0.64	
        78023	
       Household	
     229.9	
       Brazil	
  
                   0.72	
        12340	
         Audio	
       19.95	
      Mexico	
  
                   0.41	
        31240	
       Computer	
        6.99	
     Taiwan	
  
                   1.94	
        54323	
       Hardware	
       11.99	
     Taiwan	
  
                  0.023	
        92356	
       Household	
       2.05	
      USA	
  
                   0.08	
        78023	
       Computer	
      99.99	
       USA	
  
                   2.09	
        12340	
       Computer	
      129.99	
      China	
  
                    1.1	
        31240	
         Audio	
       18.99	
       China	
  
How the leaderboard works!

                   Return%	
     ProductID	
       Dept	
         Price	
      MFR	
  
                     1.94	
        54323	
       Household	
     54.95	
       USA	
  
                    0.023	
        92356	
       Household	
       9.95	
      USA	
  
                      0.8	
        78023	
       Computer	
        4.5	
       China	
  
                     0.01	
        12340	
         Audio	
       109.99	
      China	
  
                     0.41	
        31240	
         Audio	
       29.99	
      Taiwan	
  
                     0.97	
        12351	
       Hardware	
      54.95	
      Mexico	
  
                    0.0115	
  
                      0.4	
  
                                   90141	
  
                                   81240	
  
                                                 Hardware	
  
                                                 Hardware	
  
                                                                   4.99	
  
                                                                   6.55	
  
                                                                               USA	
  
                                                                              Taiwan	
      Training
                     0.03	
        14896	
       Computer	
      211.99	
     Korea	
  
                    0.205	
        62132	
       Computer	
       1100	
       USA	
  
                    1.6878	
       54323	
         Audio	
       34.99	
       USA	
  
                    0.0345	
       92356	
         Audio	
         7.99	
      USA	
  
                     0.64	
        78023	
       Household	
     229.9	
       Brazil	
  
                     0.72	
        12340	
         Audio	
       19.95	
      Mexico	
  
                     0.41	
        31240	
       Computer	
        6.99	
     Taiwan	
  
                     1.94	
        54323	
       Hardware	
       11.99	
     Taiwan	
  
        Solution    0.023	
        92356	
       Household	
       2.05	
      USA	
  
                                                                                            Test
  “Ground Truth”     0.08	
  
                     2.09	
  
                                   78023	
  
                                   12340	
  
                                                 Computer	
  
                                                 Computer	
  
                                                                 99.99	
  
                                                                 129.99	
  
                                                                               USA	
  
                                                                               China	
  
                      1.1	
        31240	
         Audio	
       18.99	
       China	
  
How the leaderboard works!

                   Return%	
     ProductID	
       Dept	
         Price	
      MFR	
  
                     1.94	
        54323	
       Household	
     54.95	
       USA	
  
                    0.023	
        92356	
       Household	
       9.95	
      USA	
  
                      0.8	
        78023	
       Computer	
        4.5	
       China	
  
                     0.01	
        12340	
         Audio	
       109.99	
      China	
  
                     0.41	
        31240	
         Audio	
       29.99	
      Taiwan	
  
                     0.97	
        12351	
       Hardware	
      54.95	
      Mexico	
  
                    0.0115	
  
                      0.4	
  
                                   90141	
  
                                   81240	
  
                                                 Hardware	
  
                                                 Hardware	
  
                                                                   4.99	
  
                                                                   6.55	
  
                                                                               USA	
  
                                                                              Taiwan	
  
                                                                                            Training
                     0.03	
        14896	
       Computer	
      211.99	
     Korea	
  
                    0.205	
        62132	
       Computer	
       1100	
       USA	
  
                    1.6878	
       54323	
         Audio	
       34.99	
       USA	
  
                    0.0345	
       92356	
         Audio	
         7.99	
      USA	
  
                     0.64	
        78023	
       Household	
     229.9	
       Brazil	
  
                       ?	
         12340	
         Audio	
       19.95	
      Mexico	
  
                       ?	
         31240	
       Computer	
        6.99	
     Taiwan	
  

        Solution       ?	
  
                       ?	
  
                                   54323	
  
                                   92356	
  
                                                 Hardware	
  
                                                 Household	
  
                                                                  11.99	
  
                                                                   2.05	
  
                                                                              Taiwan	
  
                                                                               USA	
        Test
  “Ground Truth”       ?	
  
                       ?	
  
                                   78023	
  
                                   12340	
  
                                                 Computer	
  
                                                 Computer	
  
                                                                 99.99	
  
                                                                 129.99	
  
                                                                               USA	
  
                                                                               China	
  
                       ?	
         31240	
         Audio	
       18.99	
       China	
  
How the leaderboard works!

                 Return%	
     ProductID	
       Dept	
         Price	
      MFR	
  
                   1.94	
        54323	
       Household	
     54.95	
       USA	
  
                  0.023	
        92356	
       Household	
       9.95	
      USA	
  
                    0.8	
        78023	
       Computer	
        4.5	
       China	
  
                   0.01	
        12340	
         Audio	
       109.99	
      China	
  
                   0.41	
        31240	
         Audio	
       29.99	
      Taiwan	
  
                   0.97	
        12351	
       Hardware	
      54.95	
      Mexico	
  
                  0.0115	
  
                    0.4	
  
                                 90141	
  
                                 81240	
  
                                               Hardware	
  
                                               Hardware	
  
                                                                 4.99	
  
                                                                 6.55	
  
                                                                             USA	
  
                                                                            Taiwan	
      Training
                   0.03	
        14896	
       Computer	
      211.99	
     Korea	
  
                  0.205	
        62132	
       Computer	
       1100	
       USA	
  
                  1.6878	
       54323	
         Audio	
       34.99	
       USA	
  
                  0.0345	
       92356	
         Audio	
         7.99	
      USA	
  
                   0.64	
        78023	
       Household	
     229.9	
       Brazil	
  
                   0.03	
        12340	
         Audio	
       19.95	
      Mexico	
  
                  1.298	
        31240	
       Computer	
        6.99	
     Taiwan	
  
                   0.94	
        54323	
       Hardware	
       11.99	
     Taiwan	
  
                   0.04	
  
                   0.36	
  
                                 92356	
  
                                 78023	
  
                                               Household	
  
                                               Computer	
  
                                                                 2.05	
  
                                                               99.99	
  
                                                                             USA	
  
                                                                             USA	
  
                                                                                          Test
                    1.2          12340	
       Computer	
      129.99	
      China	
  
                   0.02	
        31240	
         Audio	
       18.99	
       China	
  



                                      Submission
How the leaderboard works!

                                  Return%	
     ProductID	
       Dept	
         Price	
      MFR	
  
                                    1.94	
        54323	
       Household	
     54.95	
       USA	
  
                                   0.023	
        92356	
       Household	
       9.95	
      USA	
  
                                     0.8	
        78023	
       Computer	
        4.5	
       China	
  
                                    0.01	
        12340	
         Audio	
       109.99	
      China	
  
                                    0.41	
        31240	
         Audio	
       29.99	
      Taiwan	
  
                                    0.97	
        12351	
       Hardware	
      54.95	
      Mexico	
  
                                   0.0115	
  
                                     0.4	
  
                                                  90141	
  
                                                  81240	
  
                                                                Hardware	
  
                                                                Hardware	
  
                                                                                  4.99	
  
                                                                                  6.55	
  
                                                                                              USA	
  
                                                                                             Taiwan	
  
                                                                                                           Training
                                    0.03	
        14896	
       Computer	
      211.99	
     Korea	
  
                                   0.205	
        62132	
       Computer	
       1100	
       USA	
  
                                   1.6878	
       54323	
         Audio	
       34.99	
       USA	
  
                                   0.0345	
       92356	
         Audio	
         7.99	
      USA	
  
                                    0.64	
        78023	
       Household	
     229.9	
       Brazil	
  
        Public Leaderboard	
        0.03	
        12340	
         Audio	
       19.95	
      Mexico	
  
        Private Leaderboard	
      1.298	
        31240	
       Computer	
        6.99	
     Taiwan	
  
                                    0.94	
        54323	
       Hardware	
       11.99	
     Taiwan	
  
                                    0.04	
  
                                    0.36	
  
                                                  92356	
  
                                                  78023	
  
                                                                Household	
  
                                                                Computer	
  
                                                                                  2.05	
  
                                                                                99.99	
  
                                                                                              USA	
  
                                                                                              USA	
  
                                                                                                           Test
                                     1.2          12340	
       Computer	
      129.99	
      China	
  
                                    0.02	
        31240	
         Audio	
       18.99	
       China	
  



                                                       Submission
Area under the receiver-operating characteristic curve !
Example Model
            !
Think about!

•    Missing values!
•    Noise!
•    Combinations of features!
•    Transformations of features (e.g. log)!
•    Combinations of methods!
•    Overfitting!
•    Binary vs. continuous predictions!
•    How good is a good spam detector?!

More Related Content

PDF
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
Krishna Sankar
 
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
PDF
Data Science Folk Knowledge
Krishna Sankar
 
PDF
R, Data Wrangling & Kaggle Data Science Competitions
Krishna Sankar
 
PDF
Hpai class 14 - brain cells and memory - 031620
melendez321
 
PDF
Superintelligence: how afraid should we be?
David Wood
 
PPTX
superintelligence
Alakesh Dhibar
 
PDF
Hpai class 16 - learning - 041320
melendez321
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
Krishna Sankar
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
Data Science Folk Knowledge
Krishna Sankar
 
R, Data Wrangling & Kaggle Data Science Competitions
Krishna Sankar
 
Hpai class 14 - brain cells and memory - 031620
melendez321
 
Superintelligence: how afraid should we be?
David Wood
 
superintelligence
Alakesh Dhibar
 
Hpai class 16 - learning - 041320
melendez321
 

Viewers also liked (17)

PPTX
Globe global search system oer asia_chibajapan_2012_10_15
FBergeron
 
PPTX
Presentatie: sports and hobbies
Lou91
 
PPTX
Presentatie demoles
Lou91
 
PDF
Timeoutabu
Michael Mason
 
PPT
Ai pptseminars.com
jitendra k Singh
 
PDF
Nata
Michael Mason
 
DOCX
CV_2.3 Year_Nilesh_Btech_ECE_Informatica_SAS_BI_certified
Nilesh Gangwal
 
PPT
Android presentation
jitendra k Singh
 
PPTX
День семьи
Sovetnik
 
PPT
138693 28152-brain-chips
jitendra k Singh
 
PPTX
служба постинтернатного сопровождения
Sovetnik
 
PPT
4g mobile-communication-system-1219761984973028-8
jitendra k Singh
 
DOC
Prince_Kumar_JAVA_Developer
Prince nagsen
 
PPTX
3 g and 4g final ppt
jitendra k Singh
 
PDF
Module 4
sklarde
 
PPTX
Anima Unity - esport from amators to pros
Pavlo Kovalenko
 
Globe global search system oer asia_chibajapan_2012_10_15
FBergeron
 
Presentatie: sports and hobbies
Lou91
 
Presentatie demoles
Lou91
 
Timeoutabu
Michael Mason
 
Ai pptseminars.com
jitendra k Singh
 
CV_2.3 Year_Nilesh_Btech_ECE_Informatica_SAS_BI_certified
Nilesh Gangwal
 
Android presentation
jitendra k Singh
 
День семьи
Sovetnik
 
138693 28152-brain-chips
jitendra k Singh
 
служба постинтернатного сопровождения
Sovetnik
 
4g mobile-communication-system-1219761984973028-8
jitendra k Singh
 
Prince_Kumar_JAVA_Developer
Prince nagsen
 
3 g and 4g final ppt
jitendra k Singh
 
Module 4
sklarde
 
Anima Unity - esport from amators to pros
Pavlo Kovalenko
 
Ad

Similar to Just the basics_strata_2013 (20)

PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
PDF
Getting to Know Your Data with R
Stephen Withington
 
PDF
Data Science: Notes and Toolkits
Babis Marmanis
 
PDF
Thinkful DC - Intro to Data Science
TJ Stalcup
 
PDF
2017 06-14-getting started with data science
Thinkful
 
PDF
Intro to Data Science
TJ Stalcup
 
PPTX
Session 01 designing and scoping a data science project
Sara-Jayne Terp
 
PPTX
Session 01 designing and scoping a data science project
bodaceacat
 
PDF
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
PDF
Data Science Provenance: From Drug Discovery to Fake Fans
Jameel Syed
 
PPTX
Unit 1-FDS. .pptx
kavalishiva33
 
PPT
data science ppt of emngineering studnets
anughasha
 
PDF
Data Science Accelerator Program
GoDataDriven
 
PDF
How to develop a data scientist – What business has requested v02
Data Science London
 
PPT
Data Munging in concepts of data mining in DS
nazimsattar
 
PDF
S2-Programming_with_Data_Computational_Physics.pdf
CARLOSANDRESVIDALBET
 
PPTX
Workshop_Presentation.pptx
RUDRAPRASADSABAR
 
PDF
587_EswarPrasadReddyMachireddy_CEE
Eswar prasad Reddy Machireddy
 
PDF
598_RamaSrikanthJakkam_CEE
Rama Srikanth Jakkam
 
PPT
COM 578 Empirical Methods in Machine Learning and Data Mining
butest
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Getting to Know Your Data with R
Stephen Withington
 
Data Science: Notes and Toolkits
Babis Marmanis
 
Thinkful DC - Intro to Data Science
TJ Stalcup
 
2017 06-14-getting started with data science
Thinkful
 
Intro to Data Science
TJ Stalcup
 
Session 01 designing and scoping a data science project
Sara-Jayne Terp
 
Session 01 designing and scoping a data science project
bodaceacat
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Data Science Provenance: From Drug Discovery to Fake Fans
Jameel Syed
 
Unit 1-FDS. .pptx
kavalishiva33
 
data science ppt of emngineering studnets
anughasha
 
Data Science Accelerator Program
GoDataDriven
 
How to develop a data scientist – What business has requested v02
Data Science London
 
Data Munging in concepts of data mining in DS
nazimsattar
 
S2-Programming_with_Data_Computational_Physics.pdf
CARLOSANDRESVIDALBET
 
Workshop_Presentation.pptx
RUDRAPRASADSABAR
 
587_EswarPrasadReddyMachireddy_CEE
Eswar prasad Reddy Machireddy
 
598_RamaSrikanthJakkam_CEE
Rama Srikanth Jakkam
 
COM 578 Empirical Methods in Machine Learning and Data Mining
butest
 
Ad

Just the basics_strata_2013

  • 1.   Just the Basics: Core Data Science Skills     William Cukierski, PhD! [email protected]! 
 Ben Hamner
 [email protected]!     Photo  by  mikebaird,  www.flickr.com/photos/mikebaird  
  • 2. JUST the basics! We mean the basics! –  Ask dumb questions! (we’ll give dumb answers)! –  We can’t be comprehensive, but we can omit pretense and jargon! –  Expect a little Python, R, Matlab, Excel, command line, hand-waving!
  • 3. Pronounced Kah-gull (as in waggle),
 not Kegel (as in bagel) !
  • 4. Before we get started! You’ll need a Kaggle account ! Create a team for the competition! www.kaggle.com/account/register ! www.kaggle.com/c/just-the-basics-strata-2013! ! Add (Strata) to the end of your team name!! e.g. – William Cukierski (Strata)
 !
  • 5. Agenda: Preliminaries Identifying a Problem
 Performing the analysis
 Visualizing the Solution
 Contest!!
  • 6. Will background! Physics & Biomedical Engineering! –  Studied machine learning for diagnosis of pathology images! –  Constantly reinventing sophomore- level CS concepts! Former “successful” machine learning competitor! –  Successful?! •  Finished near top?! •  Got me a job?! •  Fooled people into believing I understand stats
 (a.k.a. “data scientist”)!
  • 7. Ben Background! Biomedical Engineering & Electrical Engineering! –  Applied machine learning to improve brain-computer interface! –  Software development in various languages / domains! Machine learning competitions! –  Top finishes in many 2010-2011! –  Teamed up with Will on several! –  Switched to the dark side, spent much of the past year designing competitions at Kaggle! Driving a Brain-Controlled Wheelchair
  • 8. The unfortunate hype of modern analytics! •  BIG DATA!! •  Every second 6.2 trillion exabytes of data are being collected! •  Need shared vocabulary, shared protocols! •  Need to leverage! –  weather reports! –  surveys! –  text documents! –  human genomes! –  regulatory information! –  cell phone logs! –  satellite surveillance ! –  etc.! –  etc.! –  etc.!
  • 9. What do we do about it?! •  Create committees, consortiums, taxonomies, platforms, frameworks, clouds! •  Create acronyms for our committees, consortiums, taxonomies, platforms, frameworks, clouds! •  Go to conferences to promote and learn about our acronym’d things! •  And if time permits and the mood strikes?! work
  • 11. I’m ready to leave now !
  • 12. Big Data Barry! Lives by the Shirky Principle:! Preserving the problem to which he is the solution! 
 Favorite talking points! Data provenance, data warehousing, data privacy, data regulations, data silos, need for standards, need for standards on standards of standards, lack of data correctness, need for communication! Source: http://mojette.deviantart.com/!
  • 13. Listen, I’ve been in this field for 22 years. The Bayesian guys in the modeling group are never gonna talk to the IT guys because they don’t speak the same language. In my 22 years of experience, what we need are tighter standards around what the processes should be for requesting data, how that data should be stored, and who should have access to the data. Also privacy. Privacy is a thing about which I have no clue, but nonetheless I’m compelled to steamroll even the most benign use of our data for anything beyond occupying a database. Oh, and speaking of databases and my 22 years of experience, we need stricter governance about the schemas a policies that inform the ways the data gets federated, so the model guys will stop trying to implement things that’ll never work.…!
  • 15. The plight of the data scientist! Job description:! Data Scientists (n.) Person who is better at statistics than any software engineer and better at software engineering than any statistician.! ! Job reality:! Data Scientists (n.) Person who is worse at statistics than any statistician and worse at software engineering than any software engineer.! ! !
  • 17. This problem can only be solved by an 8th-order I’m making an Excel VBA kernel projection onto an script to access our Oracle orthonormal space of database and find the mean homoscedastic of the revenue column! eigentensors Data science (noun): Statistics done wrong The boss is going to have my neck if I can’t get this Hadoop iPhone app ready in time for Strata
  • 18. Data science
 The application of scientific experimentation (hypothesis testing, model generation, statistical analysis) in problem- agnostic ways. ! ! Not data science! {infographics, apps, site architecture, sending JSON thingies around, Javascript frameworks, web analytics, plotting tweets on maps, cloud storage, domains that end in .io, any idea/thing/product that touches data}!
  • 19. Agenda: Preliminaries Identifying a Problem
 Performing the analysis
 Visualizing the Solution
 Contest!!
  • 20. Optimization What’s the best the can happen? Predictive Modeling What will happen next? Analytics Forecasting/extrapolation What if these trends continue? Sophistication Statistical analysis Why is this happening? Alerts What actions are needed? Query/drill down What exactly is the problem? Access and Ad hoc reports How many, how often, where? reporting Standard reports What happened? Gain Source: Competing on Analytics, Davenport/Harris, 2007!
  • 21. When to use data! Asking specific questions is mostly harmless! –  How many users bought shampoo X at store Y last quarter?! Prediction is not a free lunch! –  Being data-driven and wrong is easy and bad! –  Fancy models should serve fancy questions! •  Don’t forecast something that can be measured! Human knowledge precedes machine knowledge! –  Sometimes black boxes work! –  Often, they don’t: earthquakes, finance models, etc.!
  • 22. When to use data! Human experts are good at generalization! ! Human experts are bad at! –  Accurate predictions! –  Estimating the uncertainty of their predictions! –  Making the same prediction under the same evidence! –  Updating predictions in the face of new evidence! –  Ignoring unrelated evidence!
  • 24. We need to teach the computer to generalize laptop:~ wcuk$ RUN IT’S A BEAR -bash: BEAR: threat not found
  • 25. …without overfitting laptop:~ wcuk$ RUN IT’S A BEAR run: Must specify one of –black –grizzly –teddy laptop:~ wcuk$ RUN IT’S A BEAR -grizzly run: Are you sure you want to run? (y/n) y run: Enter the bear’s name: Rupert run: Is it Rupert with the scar on his ear? He’s cool. He’s more of a salmon kind of bear. (y/n): n run:...RUN!!!!!!!
  • 26. Storing data! Binary! Text! Database! “If you wish to make an apple pie from scratch, you must first invent the universe.” – Carl Sagan!
  • 27. Reading data into a useful format! We overcomplicate storage and formats! –  Databases are quite often a bad choice! –  Most data science is a batch process on tabular data! –  Your debugging cycle should be fast
 ! Why text?! –  Simple! –  Universal! –  Fast (to read/write/debug)! –  Transparent!
  • 28. Most data is not useful for scientific experimentation! Too “macro” (lacking causal detail)! Meant for human consumption!
  • 29. Structured data is not always machine ready ! Game 1 ! Game 2! Seat 1: Solracca ($95.30 in chips) Seat 1: Kingcovey ($108.65 in chips) Seat 2: BrickT63 ($127.10 in chips) Seat 3: VoronIN_exe ($119.80 in chips) Seat 3: sven160482 ($184.30 in chips) Seat 4: ehle123 ($104 in chips) Seat 4: Adelantez ($103 in chips) Seat 5: MercuriusAA ($107.60 in chips) Seat 6: manfred zeal ($155.50 in chips) Seat 6: budapestkin ($133.15 in chips) Solracca: posts small blind $0.50 budapestkin: posts small blind $0.50 BrickT63: posts big blind $1 Kingcovey: posts big blind $1 *** HOLE CARDS *** *** HOLE CARDS *** sven160482: raises $1 to $2 VoronIN_exe: raises $2 to $3 Adelantez: raises $5.50 to $7.50 ehle123: folds manfred zeal: folds MercuriusAA: folds Solracca: folds budapestkin: calls $2.50 BrickT63: folds Kingcovey: folds sven160482: folds *** FLOP *** [7c Tc Ks] Uncalled bet ($5.50) returned to Adelantez budapestkin: checks Adelantez collected $5.50 from pot VoronIN_exe: bets $4.45 *** SUMMARY *** budapestkin: calls $4.45 Total pot $5.50 | Rake $0 *** TURN *** [7c Tc Ks] [8c] Seat 4: Adelantez collected ($5.50) budapestkin: checks VoronIN_exe: checks *** RIVER *** [7c Tc Ks 8c] [Kc] budapestkin: bets $11 VoronIN_exe: folds Uncalled bet ($11) returned to budapestkin budapestkin collected $15.15 from pot *** SUMMARY *** Total pot $15.90 | Rake $0.75 Seat 6: budapestkin collected ($15.15)
  • 30. A word of caution on scraping! •  Scraping is time intensive, unleveraged, brittle! •  Before you code, research existing libraries!! –  Will solve 95% of the problems you don’t even know you will have! –  E.g. web scraping using python’s BeautifulSoup! page = urllib2.urlopen("http://www.kaggle.com/competitions") soup = BeautifulSoup(page.read()) allLinks = soup.find_all('a') allLinks = uniqify(allLinks) for link in allLinks: match = (re.search('^/c/.*', link.get('href'))) if match: fileName = link.get('href'); fileName = fileName.replace('/','_') + ".zip" fileName = fileName[3:] getStuff(fileName, "http://www.kaggle.com" + link.get("href") + "/publicleaderboarddata.zip")
  • 31. Excel has a time and place! –  Looking at data! –  Pivot tables! –  Quick plots to verify things! Never:! –  Pass spreadsheets around! –  “Code” in Excel! –  Create workflows that require copy/ pasting data around!
  • 32. Excel !
  • 33. Agenda: Preliminaries Identifying a Problem
 Performing the analysis
 Visualizing the Solution
 Contest!!
  • 35. Glossary! features = attributes = independent variables! targets = gold standard = ground truth = dependent variable(s)! training set = data & targets use to train a model! validation set = data & targets used as feedback in model training! test set = separate data & targets used only to evaluate the model! cross validation = partitioning the training set to estimate how well a model will generalize!
  • 36. Feature Read! Learn! Extraction! Train! Generalize! Test!
  • 37. Bayes theorem! How to update beliefs in the face of evidence?! For proposition A and evidence B:! P (B|A)P (A) –  P(A) = prior (belief in A)! P (A|B) = P (B) –  P(B) = evidence! –  P(A | B) = posterior (belief in A given B)! –  P(B | A) = likelihood! P (long hair|f emale)P (f emale) P (f emale|long hair) = P (long hair)
  • 38. R!
  • 40. Agenda: Preliminaries Identifying a Problem
 Performing the analysis
 Visualizing the Solution
 Contest!!
  • 42. Visualization! Speak the language of your audience! –  Use simple plots! –  Use units that matter (dollars, time, widgets)! –  Include the units!! –  Don’t use acronyms!! ! Most visualization should be internal facing (am I doing this right?) and not external facing (hey check this out!)!
  • 43. •  Babysitting model performance! •  Plotting raw features! •  Looking for optima! •  Looking for outliers, •  Watching for sensitivity to initial anomalies, correlation! conditions, perturbations! •  Verifying feature selection or •  Summarizing! dimensionality reduction! •  Checking the result is reasonable! •  Looking at manifold density! •  Comparisons to the alternative! •  Looking at class separation!
  • 44. Your job is to solve a problem! –  Sell the message, not the graphic! Avoid chartjunk! “The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.” –Edward Tufte!
  • 47. Election fraud: 2D histograms of the number of units for a given voter turnout (x axis) and the percentage of votes (y axis) for the winning party! source: http://www.pnas.org/content/early/2012/09/20/1210722109.abstract
  • 49. Agenda: Preliminaries Identifying a Problem
 Performing the analysis
 Visualizing the Solution
 Contest!!
  • 50. Make a spam detector! The data represents a corpus of emails. Some are spam and some are normal.! •  Due to time constraints, feature extraction is done for you:! –  train.csv - contains 600 emails x 100 features! –  train_labels.csv – contains the 600 training labels (1 = spam, 0 = normal)! –  test.csv - contains 4000 emails x 100 features! •  Submit a file with each of the 4000 predictions on a separate line (in the same order as test.csv).! –  No header is necessary! –  Predictions can be continuous numbers or 0/1 labels!
  • 51. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   90141   Hardware   4.99   USA   0.4   81240   Hardware   6.55   Taiwan   0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   0.72   12340   Audio   19.95   Mexico   0.41   31240   Computer   6.99   Taiwan   1.94   54323   Hardware   11.99   Taiwan   0.023   92356   Household   2.05   USA   0.08   78023   Computer   99.99   USA   2.09   12340   Computer   129.99   China   1.1   31240   Audio   18.99   China  
  • 52. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   0.4   90141   81240   Hardware   Hardware   4.99   6.55   USA   Taiwan   Training 0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   0.72   12340   Audio   19.95   Mexico   0.41   31240   Computer   6.99   Taiwan   1.94   54323   Hardware   11.99   Taiwan   Solution 0.023   92356   Household   2.05   USA   Test “Ground Truth” 0.08   2.09   78023   12340   Computer   Computer   99.99   129.99   USA   China   1.1   31240   Audio   18.99   China  
  • 53. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   0.4   90141   81240   Hardware   Hardware   4.99   6.55   USA   Taiwan   Training 0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   ?   12340   Audio   19.95   Mexico   ?   31240   Computer   6.99   Taiwan   Solution ?   ?   54323   92356   Hardware   Household   11.99   2.05   Taiwan   USA   Test “Ground Truth” ?   ?   78023   12340   Computer   Computer   99.99   129.99   USA   China   ?   31240   Audio   18.99   China  
  • 54. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   0.4   90141   81240   Hardware   Hardware   4.99   6.55   USA   Taiwan   Training 0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   0.03   12340   Audio   19.95   Mexico   1.298   31240   Computer   6.99   Taiwan   0.94   54323   Hardware   11.99   Taiwan   0.04   0.36   92356   78023   Household   Computer   2.05   99.99   USA   USA   Test 1.2 12340   Computer   129.99   China   0.02   31240   Audio   18.99   China   Submission
  • 55. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   0.4   90141   81240   Hardware   Hardware   4.99   6.55   USA   Taiwan   Training 0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   Public Leaderboard   0.03   12340   Audio   19.95   Mexico   Private Leaderboard   1.298   31240   Computer   6.99   Taiwan   0.94   54323   Hardware   11.99   Taiwan   0.04   0.36   92356   78023   Household   Computer   2.05   99.99   USA   USA   Test 1.2 12340   Computer   129.99   China   0.02   31240   Audio   18.99   China   Submission
  • 56. Area under the receiver-operating characteristic curve !
  • 58. Think about! •  Missing values! •  Noise! •  Combinations of features! •  Transformations of features (e.g. log)! •  Combinations of methods! •  Overfitting! •  Binary vs. continuous predictions! •  How good is a good spam detector?!