SlideShare a Scribd company logo
Python for Business
        Intelligence


Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
python business intelligence




                )
Results

Q/A and articles with Java
  solution references


               (not listed here)
Python business intelligence (PyData 2012 talk)
Why?
Overview

■ Traditional Data Warehouse
■ Python and Data
■ Is Python Capable?
■ Conclusion
Business
Intelligence
people

technology processes
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Traditional Data
  Warehouse
■ Extracting data from the original sources

■ Quality assuring and cleaning data

■ Conforming the labels and measures
   in the data to achieve consistency across the original sources



■ Delivering data in a physical format that can be used by
   query tools, report writers, and dashboards.




                         Source: Ralph Kimball – The Data Warehouse ETL Toolkit
Source               Staging Area     Operational Data Store   Datamarts
Systems



   structured
   documents




   databases

                Temporary
                Staging
                Area
      APIs




                            staging              relational        dimensional

                             L0                    L1                 L2
real time = daily
Multi-dimensional
    Modeling
Python business intelligence (PyData 2012 talk)
aggregation browsing
     slicing and dicing
business / analyst’s
       point of view

regardless of physical schema implementation
Facts

                  measurable


     fact

                   fact data cell




most detailed information
location




type




              time



           dimensions
Dimension

■ provide context for facts
■ used to filter queries or reports
■ control scope of aggregation of facts
Pentaho
Python and Data
   community perception*




                           *as of Oct 2012
Scientific & Financial
Python
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Scientific Data
      T1[s]     T2[s]     T3[s]     T4[s]
P1     112,68    941,67    171,01    660,48

P2      96,15    306,51    725,88    877,82

P3     313,39    189,31     41,81    428,68

P4     760,62    983,48    371,21    281,19

P5     838,56     39,27    389,42    231,12




     n-dimensional array of numbers
Assumptions

■ data is mostly numbers
■ data is neatly organized...
■ … in one multi-dimensional array
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Business Data
multiple snapshots of one source




multiple representations     categories are

     of same data                  changing
❄
Is Python Capable?
     very basic examples
Data Pipes with
   SQLAlchemy

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
■ connection: create_engine
■ schema reflection: MetaData,   Table

■ expressions: select(),   insert()
src_engine = create_engine("sqlite:///data.sqlite")
src_metadata = MetaData(bind=src_engine)
src_table = Table('data', src_metadata, autoload=True)




target_engine = create_engine("postgres://localhost/sandbox")
target_metadata = MetaData(bind=target_engine)
target_table = Table('data', target_metadata)
clone schema:

for column in src_table.columns:
    target_table.append_column(column.copy())

target_table.create()




copy data:

insert = target_table.insert()

for row in src_table.select().execute():
    insert.execute(row)
magic used:

metadata reflection
text file (CSV) to table:




reader = csv.reader(file_stream)

columns = reader.next()

for column in columns:
    table.append_column(Column(column, String))

table.create()

for row in reader:
    insert.execute(row)
Simple T from ETL

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
transformation = [

 ('fiscal_year',         {"w function": int,
                          ". field":"fiscal_year"}),
 ('region_code',         {"4 mapping": region_map,
                          ". field":"region"}),
 ('borrower_country',    None),
 ('project_name',        None),
 ('procurement_type',    None),
 ('major_sector_code',   {"4 mapping": sector_code_map,
                          ". field":"major_sector"}),
 ('major_sector',        None),
 ('supplier',            None),
 ('contract_amount',     {"w function": currency_to_number,
                          ". field": 'total_contract_amount'}
 ]



     target fields        source transformations
Transformation

for row in source:
    result = transform(row, [ transformation)
    table.insert(result).execute()
OLAP with Cubes

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Model
           {
               “name” = “My Model”
               “description” = ....

               “cubes” = [...]
               “dimensions” = [...]
           }




cubes                          dimensions
measures                        levels, attributes, hierarchy
logical




              physical

          ❄
1   load_model("model.json")

           Application



                  ∑

                                 3   model.cube("sales")
                                 4   workspace.browser(cube)


             cubes

       Aggregation Browser
            backend



2   create_workspace("sql",
                     model,
                     url="sqlite:///data.sqlite")
browser.aggregate(o cell,
                  . drilldown=[9 "sector"])




                        drill-down
for row in result.table_rows(“sector”):




          row.record["amount_sum"]
q row.label                     k row.key
whole cube


                                           o cell = Cell(cube)
                                           browser.aggregate(o cell)
                Total




                                          browser.aggregate(o cell,
                                                       drilldown=[9 “date”])


2006 2007 2008 2009 2010


                                          ✂ cut = PointCut(9 “date”, [2010])
                                          o cell = o cell.slice(✂ cut)

                                          browser.aggregate(o cell,
                                                       drilldown=[9 “date”])
Jan   Feb Mar Apr March April May   ...
How can Python
  be Useful
just the   Language
 ■ saves maintenance resources
 ■ shortens development time
 ■ saves your from going insane
Source               Staging Area      Operational Data Store   Datamarts
Systems



   structured
   documents




   databases
                                      faster
                Temporary
                Staging
                Area
      APIs




                            staging               relational        dimensional

                             L0                     L1                 L2
faster                      advanced


 Data                                            Analysis and
          Extraction, Transformation, Loading
Sources                                          Presentation

                       Data Governance

                   Technologies and Utilities




    understandable, maintainable
Conclusion
BI is about…



       people

technology processes
don’t forget
 metadata
Future

who is going to fix your COBOL Java tool
 if you have only Python guys around?
is capable, let’s start
Thank You
      [t


          Twitter:

        @Stiivi
     DataBrewery blog:

blog.databrewery.org
          Github:

  github.com/Stiivi

More Related Content

What's hot (20)

PDF
Object Detection Using R-CNN Deep Learning Framework
Nader Karimi
 
PDF
Recommender Systems
Carlos Castillo (ChaTo)
 
PPTX
Recommender systems for E-commerce
Alexander Konduforov
 
PDF
Machine learning vs deep learning
USM Systems
 
PDF
Mask R-CNN
Chanuk Lim
 
PDF
Deep neural network for youtube recommendations
Kan-Han (John) Lu
 
PDF
Machine Learning
Shrey Malik
 
PDF
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
PPT
Projet BI - 1 - Analyse des besoins
Jean-Marc Dupont
 
PDF
Artwork Personalization at Netflix Fernando Amat RecSys2018
Fernando Amat
 
PPTX
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
Joonhyung Lee
 
PDF
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 
PDF
Building a Feature Store around Dataframes and Apache Spark
Databricks
 
PDF
Calibrated Recommendations
Harald Steck
 
PDF
Business intelligence | State of the art
Shubham Sharma
 
PDF
Big data storage
Vikram Nandini
 
PPTX
Stochastic Gradient Decent (SGD).pptx
Shubham Jaybhaye
 
PDF
OLAP IN DATA MINING
wilifred
 
PDF
Human Action Recognition
NAVER Engineering
 
PDF
Collaborative Filtering 1: User-based CF
Yusuke Yamamoto
 
Object Detection Using R-CNN Deep Learning Framework
Nader Karimi
 
Recommender Systems
Carlos Castillo (ChaTo)
 
Recommender systems for E-commerce
Alexander Konduforov
 
Machine learning vs deep learning
USM Systems
 
Mask R-CNN
Chanuk Lim
 
Deep neural network for youtube recommendations
Kan-Han (John) Lu
 
Machine Learning
Shrey Malik
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
Projet BI - 1 - Analyse des besoins
Jean-Marc Dupont
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Fernando Amat
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
Joonhyung Lee
 
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 
Building a Feature Store around Dataframes and Apache Spark
Databricks
 
Calibrated Recommendations
Harald Steck
 
Business intelligence | State of the art
Shubham Sharma
 
Big data storage
Vikram Nandini
 
Stochastic Gradient Decent (SGD).pptx
Shubham Jaybhaye
 
OLAP IN DATA MINING
wilifred
 
Human Action Recognition
NAVER Engineering
 
Collaborative Filtering 1: User-based CF
Yusuke Yamamoto
 

Similar to Python business intelligence (PyData 2012 talk) (20)

KEY
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Stefan Urbanek
 
PDF
Microsoft Big Data @ SQLUG 2013
Nathan Bijnens
 
PDF
Big data berlin
kammeyer
 
PDF
Open data Websmatch
data publica
 
PPT
Web smatch wod2012
data publica
 
PDF
Python's Role in the Future of Data Analysis
Peter Wang
 
KEY
Processing Big Data
cwensel
 
PDF
Emergent Distributed Data Storage
hybrid cloud
 
PPTX
From open data to API-driven business
OpenDataSoft
 
PDF
04 open source_tools
Marco Quartulli
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
PPT
Data Munging in concepts of data mining in DS
nazimsattar
 
PDF
What Does Big Data Mean and Who Will Win
BigDataCloud
 
PPT
Data mining - GDi Techno Solutions
GDi Techno Solutions
 
PDF
Anaconda and PyData Solutions
Travis Oliphant
 
PDF
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
PDF
Continuum Analytics and Python
Travis Oliphant
 
PDF
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
IJET - International Journal of Engineering and Techniques
 
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Stefan Urbanek
 
Microsoft Big Data @ SQLUG 2013
Nathan Bijnens
 
Big data berlin
kammeyer
 
Open data Websmatch
data publica
 
Web smatch wod2012
data publica
 
Python's Role in the Future of Data Analysis
Peter Wang
 
Processing Big Data
cwensel
 
Emergent Distributed Data Storage
hybrid cloud
 
From open data to API-driven business
OpenDataSoft
 
04 open source_tools
Marco Quartulli
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Data Munging in concepts of data mining in DS
nazimsattar
 
What Does Big Data Mean and Who Will Win
BigDataCloud
 
Data mining - GDi Techno Solutions
GDi Techno Solutions
 
Anaconda and PyData Solutions
Travis Oliphant
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Continuum Analytics and Python
Travis Oliphant
 
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
IJET - International Journal of Engineering and Techniques
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Ad

More from Stefan Urbanek (19)

PDF
StepTalk Introduction
Stefan Urbanek
 
PDF
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Stefan Urbanek
 
PDF
Sepro - introduction
Stefan Urbanek
 
PDF
New york data brewery meetup #1 – introduction
Stefan Urbanek
 
PDF
Cubes 1.0 Overview
Stefan Urbanek
 
PDF
Cubes – pluggable model explained
Stefan Urbanek
 
PDF
Cubes – ways of deployment
Stefan Urbanek
 
PDF
Knowledge Management Lecture 4: Models
Stefan Urbanek
 
PDF
Dallas Data Brewery Meetup #2: Data Quality Perception
Stefan Urbanek
 
PDF
Dallas Data Brewery - introduction
Stefan Urbanek
 
PDF
Bubbles – Virtual Data Objects
Stefan Urbanek
 
PDF
Knowledge Management Lecture 3: Cycle
Stefan Urbanek
 
PDF
Knowledge Management Lecture 2: Individuals, communities and organizations
Stefan Urbanek
 
KEY
Knowledge Management Lecture 1: definition, history and presence
Stefan Urbanek
 
KEY
Open spending as-is 2011-06
Stefan Urbanek
 
PDF
Cubes - Lightweight OLAP Framework
Stefan Urbanek
 
PDF
Open Data Decentralisation
Stefan Urbanek
 
PDF
Data Cleansing introduction (for BigClean Prague 2011)
Stefan Urbanek
 
PDF
Knowledge Management Introduction
Stefan Urbanek
 
StepTalk Introduction
Stefan Urbanek
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Stefan Urbanek
 
Sepro - introduction
Stefan Urbanek
 
New york data brewery meetup #1 – introduction
Stefan Urbanek
 
Cubes 1.0 Overview
Stefan Urbanek
 
Cubes – pluggable model explained
Stefan Urbanek
 
Cubes – ways of deployment
Stefan Urbanek
 
Knowledge Management Lecture 4: Models
Stefan Urbanek
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Stefan Urbanek
 
Dallas Data Brewery - introduction
Stefan Urbanek
 
Bubbles – Virtual Data Objects
Stefan Urbanek
 
Knowledge Management Lecture 3: Cycle
Stefan Urbanek
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Stefan Urbanek
 
Knowledge Management Lecture 1: definition, history and presence
Stefan Urbanek
 
Open spending as-is 2011-06
Stefan Urbanek
 
Cubes - Lightweight OLAP Framework
Stefan Urbanek
 
Open Data Decentralisation
Stefan Urbanek
 
Data Cleansing introduction (for BigClean Prague 2011)
Stefan Urbanek
 
Knowledge Management Introduction
Stefan Urbanek
 
Ad

Python business intelligence (PyData 2012 talk)

  • 1. Python for Business Intelligence Štefan Urbánek ■ @Stiivi ■ [email protected] ■ PyData NYC, October 2012
  • 3. Results Q/A and articles with Java solution references (not listed here)
  • 6. Overview ■ Traditional Data Warehouse ■ Python and Data ■ Is Python Capable? ■ Conclusion
  • 9. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 10. Traditional Data Warehouse
  • 11. ■ Extracting data from the original sources ■ Quality assuring and cleaning data ■ Conforming the labels and measures in the data to achieve consistency across the original sources ■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit
  • 12. Source Staging Area Operational Data Store Datamarts Systems structured documents databases Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 13. real time = daily
  • 14. Multi-dimensional Modeling
  • 16. aggregation browsing slicing and dicing
  • 17. business / analyst’s point of view regardless of physical schema implementation
  • 18. Facts measurable fact fact data cell most detailed information
  • 19. location type time dimensions
  • 20. Dimension ■ provide context for facts ■ used to filter queries or reports ■ control scope of aggregation of facts
  • 22. Python and Data community perception* *as of Oct 2012
  • 25. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 26. Scientific Data T1[s] T2[s] T3[s] T4[s] P1 112,68 941,67 171,01 660,48 P2 96,15 306,51 725,88 877,82 P3 313,39 189,31 41,81 428,68 P4 760,62 983,48 371,21 281,19 P5 838,56 39,27 389,42 231,12 n-dimensional array of numbers
  • 27. Assumptions ■ data is mostly numbers ■ data is neatly organized... ■ … in one multi-dimensional array
  • 28. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 30. multiple snapshots of one source multiple representations categories are of same data changing
  • 31.
  • 32. Is Python Capable? very basic examples
  • 33. Data Pipes with SQLAlchemy Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 34. ■ connection: create_engine ■ schema reflection: MetaData, Table ■ expressions: select(), insert()
  • 35. src_engine = create_engine("sqlite:///data.sqlite") src_metadata = MetaData(bind=src_engine) src_table = Table('data', src_metadata, autoload=True) target_engine = create_engine("postgres://localhost/sandbox") target_metadata = MetaData(bind=target_engine) target_table = Table('data', target_metadata)
  • 36. clone schema: for column in src_table.columns: target_table.append_column(column.copy()) target_table.create() copy data: insert = target_table.insert() for row in src_table.select().execute(): insert.execute(row)
  • 38. text file (CSV) to table: reader = csv.reader(file_stream) columns = reader.next() for column in columns: table.append_column(Column(column, String)) table.create() for row in reader: insert.execute(row)
  • 39. Simple T from ETL Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 40. transformation = [ ('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ] target fields source transformations
  • 41. Transformation for row in source: result = transform(row, [ transformation) table.insert(result).execute()
  • 42. OLAP with Cubes Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 43. Model { “name” = “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] } cubes dimensions measures levels, attributes, hierarchy
  • 44. logical physical ❄
  • 45. 1 load_model("model.json") Application ∑ 3 model.cube("sales") 4 workspace.browser(cube) cubes Aggregation Browser backend 2 create_workspace("sql", model, url="sqlite:///data.sqlite")
  • 46. browser.aggregate(o cell, . drilldown=[9 "sector"]) drill-down
  • 47. for row in result.table_rows(“sector”): row.record["amount_sum"] q row.label k row.key
  • 48. whole cube o cell = Cell(cube) browser.aggregate(o cell) Total browser.aggregate(o cell, drilldown=[9 “date”]) 2006 2007 2008 2009 2010 ✂ cut = PointCut(9 “date”, [2010]) o cell = o cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”]) Jan Feb Mar Apr March April May ...
  • 49. How can Python be Useful
  • 50. just the Language ■ saves maintenance resources ■ shortens development time ■ saves your from going insane
  • 51. Source Staging Area Operational Data Store Datamarts Systems structured documents databases faster Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 52. faster advanced Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities understandable, maintainable
  • 54. BI is about… people technology processes
  • 56. Future who is going to fix your COBOL Java tool if you have only Python guys around?
  • 58. Thank You [t Twitter: @Stiivi DataBrewery blog: blog.databrewery.org Github: github.com/Stiivi