SlideShare a Scribd company logo
Module – 5
The PANDAS
(Chapter-5)
Reading & Writing Data
I/O API Tools
There are two main functions used in Pandas Library that helps in Data Analysis,
which are known as I/O API and they are;
1. Readers
2. Writers
Readers Writers
read_csv to_csv
read_excel to_excel
read_hdf to_hdf
read_sql to_sql
read_json to_json
read_html to_html
read_stata to_stata
read_clipboard to_clipboard
read_pickle to_pickle
read_msgpack to_msgpack (experimental)
read_gbq to_gbq (experimental)
CSV and Textual Files
If the values in a row are separated by a comma, you have the CSV
(comma-separated values) format.
Other forms with tabular data separated by spaces or tabs are
typically contained in text files of various types (generally with the
extension .txt)
pandas provides a set of functions specific for this type of file;
• read_csv
• read_table
• to_csv
Reading Data in CSV or Text Files
The most common operation for a person approaching data analysis is to read
the data contained in a CSV file, or at least in a text file.
myCSV_01.csv
white,red,blue,green,animal
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse
Since this file is comma-delimited, you can use the read_csv() function to
read its content and convert it at the same time in a DataFrame object.
CSV files are tabulated data in which the values on the same column are
separated by commas. But since CSV files are considered text files, you can
also use the read_table() function, but specifying the delimiter.
You can notice that in the CSV file, headers to identify all the columns are in
the first row. But this is not a general case, it often happens that the tabulated
data begin directly from the first line.
myCSV_02.csv
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse
In this case, then you could make sure that it is precisely pandas to assign
default names to the columns by using the header option set to None.
In addition, there is also the possibility to specify the names directly
assigning a list of labels to the names option.
In more complex cases, in which you want to create a DataFrame with a
hierarchical structure by reading a CSV file, you can extend the functionality of
the read_csv() function by adding the index_col option, assigning all the
columns to be converted into indexes to it.
myCSV_03.csv
color,status,item1,item2,item3
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
white,left,1,2,1
red,up,2,2,2
red,down,1,1,4
Using RegExp for Parsing TXT Files
In other cases, it is possible that the files on which to parse the data do not
show separators well defined as a comma or a semicolon.
In these cases, the regular expressions come to our aid.
In fact, you can specify a regexp within the read_table() function using the sep
option.
Usually, you think of the separators as special characters like commas,
spaces, tabs, etc. but in reality you could consider separator characters as
alphanumeric characters, or for example, as integers such as 0.
In this example, you need to extract the numeric part from a TXT file, in which
there is a sequence of characters with numerical values and literal characters
are completely fused.
Remember to use the header option set to None whenever the column
headings are not present in the TXT file.
Another fairly common event is to exclude lines from parsing. In fact you do not
always want to include headers or unnecessary comments contained within a
file.
With the skiprows option you can exclude all the lines you want, just assigning
an array containing the line numbers to not consider in parsing.
If you want to exclude the first five lines, then you have to write skiprows = 5,
but if we want to rule out the fifth line you have to write skiprows = [5].
Reading TXT Files into Parts or Partially
When large files are processed, or when you’re only interested in portions of
these files, you often need to read the file into portions (chunks).
So if for example you want to read only a portion of the file, you can explicitly
specify the number of lines on which to parse.
We can use the nrows and skiprows options, you can select the starting line n
(n = SkipRows) and the lines to be read after it (nrows = i).
If you want to write to a CSV file, the data contained in a DataFrame, we will
use the to_csv() function that accepts as an argument the name of the file you
generate.
As you can see from the previous example, when you make the writing of a
data frame to a file, by default both indexes and columns are marked on the
file. This default behavior can be changed by placing the two options index and
header set to False.
One thing to take into account when making the writing of files is that NaN values
present in a data structure are shown as empty fields in the file.
We can replace this empty field with a value to your liking using the na_rep
option in the to_csv() function.
Common values may be NULL, 0, or the same NaN.
Reading and Writing HTML Files
With regard to the HTML format pandas provides the corresponding pair of I/O
API functions.
• read_html()
• to_html()
Writing Data in HTML
See how to convert a DataFrame into an HTML table.
The internal structure of the data frame is automatically converted into nested
tags <TH>, <TR>, <TD> retaining any internal hierarchies.
Example:
The Pandas Chapter 5(Important Questions).pdf
To write an HTML page through the generation of a string.
❖ First of all we create a string that contains the code of the HTML page.
❖ Now you can write the contents from the string “html” directly on the file that will
be called myFrame.html
❖ Now in your working directory will be a new HTML file, myFrame.html.
Double-click it to open it directly from the browser. An HTML table will appear in
the upper left as shown below;
Reading Data from an HTML File
Reading Data from XML
In the list of I/O API functions, there is no specific tool regarding the XML
(Extensible Markup Language) format.
In fact, although it is not listed, this format is very important, because many
structured data are available in XML format.
This presents no problem, since Python has many other libraries (besides
pandas) that manage the reading and writing of data in XML format.
One of these libraries is the lxml library, which stands out for its excellent
performance during the parsing of very large files.
Example:
• In this example you will take the data structure described in the XML file to
convert it directly into a DataFrame.
• To do so the first thing to do is use the sub-module objectify of the lxml library,
importing it in the following way.
>>> from lxml import objectify
>>> xml = objectify.parse('books.xml')
>>> xml
<lxml.etree._ElementTree object at 0x0000000009734E08>
You got an object tree, which is an internal data structure of the module lxml.
Look in more detail at this type of object. To navigate in this tree structure, so as to
select
element by element, you must first define the root. You can do this with the
getroot() function.
>>> root = xml.getroot()
Now that the root of the structure has been defined, you can access the various
nodes of the tree, each corresponding to the tag contained within the original XML
file.
>>> root.Book.Author
'Ross, Mark'
>>> root.Book.PublishDate
❖ In this way you access nodes individually, but you can access various elements
at the same time using getchildren(). With this function, you’ll get all the child
nodes of the reference element.
>>> root.getchildren()
[<Element Book at 0x9c66688>, <Element Book at 0x9c66e08>]
❖ With the tag attribute you get the name of the tag corresponding to the child
node.
>>> [child.tag for child in root.Book.getchildren()]
['Author', 'Title', 'Genre', 'Price', 'PublishDate’]
❖ with the text attribute you get the value contained between the corresponding
tags.
>>> [child.text for child in root.Book.getchildren()]
['Ross, Mark', 'XML Cookbook', 'Computer', '23.56', '2014-22-01']
Reading and Writing Data on Microsoft Excel Files
❖ pandas provides specific functions to read and write data with Excel files and the
I/O API that provides functions to this purpose are:
• to_excel()
• read_excel()
❖ The read_excel() function is able to read both Excel 2003 (.xls) files and Excel
2007 (.xlsx) files.
Example:
First, open an Excel file and enter the data as shown below. Copy data in sheet1
and sheet2. Then save it as data.xls.
The Pandas Chapter 5(Important Questions).pdf
❖ To read the data contained within the XLS file and obtain the conversion into a
data frame, you only have to use the read_excel() function.
❖ As you can see, by default, the returned DataFrame is composed of the data
tabulated in the first spreadsheets.
❖ If, however, you’d need to load the data in the second spreadsheet, then specify
the name of the sheet or the number of the sheet (index) just as the second
argument.
❖ To convert a data frame in a spreadsheet on Excel you have to write as follows.
>>> frame = pd.DataFrame(np.random.random((4,4)),
index = ['exp1','exp2','exp3','exp4'],
columns = ['Jan2015','Fab2015','Mar2015','Apr2005’])
❖ In the working directory you will find a new Excel file containing the data as
shown;
JSON Data
❖ JSON (JavaScript Object Notation) has become one of the most common
standard formats, especially for the transmission of data through the Web.
❖ So it is normal to have to do with this data format if you want to use the
available data on the Web.
Step-1:
❖ The converse is possible, using the read_json()
with the name of the file passed as an argument.
❖ Generally, however, the JSON files do not have a tabular structure. Thus, you will
need to somehow convert the structure dict file in tabular form.
❖ The library pandas provides a function, called json_normalize(), that is able to
convert a dict or a list in a table.
❖ First you have to import the function
>>> from pandas.io.json import json_normalize
❖ Then write a JSON file as described below with any text editor. Save it in the
working directory as books.json
❖ As you can see, the file structure is no longer tabular, but more complex. Then the
approach with the read_json() function is no longer valid.
❖ You can still get the data in tabular form from this structure.
❖ First you have to load the contents of the JSON file and convert it into a string.
❖ Now you are ready to apply the json_normalize() function.
Pickle—Python Object Serialization
❖ The pickle module implements a powerful algorithm for serialization and
de-serialization of a data structure implemented in Python.
❖ Pickling is the process in which the hierarchy of an object is converted into a
stream of bytes.
❖ In Python, the picking operation is carried out by the pickle module, but
currently there is a module called cPickle which is the result of an enormous
amount of work optimizing the pickle module (written in C).
❖ This module can be in fact in many cases even 1,000 times faster than the
pickle module.
Serialize a Python Object with cPickle
❖ The data format used by the pickle module (or cPickle) is specific to Python.
❖ By default, an ASCII representation is used to represent it, in order to be readable
from the human point of view.
❖ Then opening a file with a text editor you may be able to understand its contents.
❖ To use this module you must first import it
>>> import cPickle as pickle
❖ Then create an object sufficiently complex to have an internal data structure, for
example a dict object.
>>> data = { 'color': ['white','red'], 'value': [5, 7]}
❖ Now you will perform a serialization of the data object through the dumps()
function of the cPickle module.
>>> pickled_data = pickle.dumps(data)
❖ Now, to see how it was serialized the dict object, you need to look at the
content of the pickled_data variable.
❖ Once serialized data, they can easily be written on a file, or sent over a socket,
pipe, etc.
Pickling with pandas
❖ As regards the operation of pickling (and unpickling) with the pandas library,
everything remains much facilitated. No need to import the cPickle module in the
Python session and also the whole operation is performed implicitly.
❖ Also, the serialization format used by pandas is not completely in ASCII.
>>> frame = pd.DataFrame(np.arange(16).reshape(4,4), index =
['up','down','left','right’])
>>> frame.to_pickle('frame.pkl’)
❖ Now in your working directory there is a new file called frame.pkl containing all
the information about the frame DataFrame.
❖ To open a PKL file and read the contents, simply use the command

More Related Content

PDF
Dealing with files in python specially CSV files
Kiran Kumaraswamy
 
PPTX
Pandas csv
Devashish Kumar
 
PPTX
Pandas-(Ziad).pptx
Sivam Chinna
 
PPTX
Introduccion a Pandas_cargar datos, modelar, analizar, manipular y prepararlo...
ssuser36fa07
 
PPTX
Lecture 3 intro2data
Johnson Ubah
 
PPTX
Python Pandas.pptx
SujayaBiju
 
PPTX
ReadingWriting_CSV_files.pptx sjdjs sjbjs sjnd
ahmadalibzuwork
 
PPTX
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
HarshitChauhan88
 
Dealing with files in python specially CSV files
Kiran Kumaraswamy
 
Pandas csv
Devashish Kumar
 
Pandas-(Ziad).pptx
Sivam Chinna
 
Introduccion a Pandas_cargar datos, modelar, analizar, manipular y prepararlo...
ssuser36fa07
 
Lecture 3 intro2data
Johnson Ubah
 
Python Pandas.pptx
SujayaBiju
 
ReadingWriting_CSV_files.pptx sjdjs sjbjs sjnd
ahmadalibzuwork
 
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
HarshitChauhan88
 

Similar to The Pandas Chapter 5(Important Questions).pdf (20)

PDF
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
PPTX
CSV JSON and XML files in Python.pptx
Ramakrishna Reddy Bijjam
 
PPTX
Python and CSV Connectivity
Neeru Mittal
 
PPTX
Unit 5 Introduction to Built-in Packages in python .pptx
grpvasundhara1993
 
PPTX
Data analysis with pandas
Outreach Digital
 
PDF
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
AminaRepo
 
PDF
Panda data structures and its importance in Python.pdf
sumitt6_25730773
 
PPTX
Pandas Dataframe reading data Kirti final.pptx
Kirti Verma
 
PPTX
data science for engineering reference pdf
fatehiaryaa
 
PPTX
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
kalai75
 
PDF
pandas.pdf
AjeshSurejan2
 
PDF
pandas (1).pdf
AjeshSurejan2
 
PPTX
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
PDF
Data analystics with R module 3 cseds vtu
LalithauLali
 
PDF
Importing Data Sets | Importing Data Sets | Importing Data Sets
Ayxanhmdzad
 
PPTX
BINARY files CSV files JSON files with example.pptx
Ramakrishna Reddy Bijjam
 
PPTX
Unit 1 Ch 2 Data Frames digital vis.pptx
abida451786
 
PPTX
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
PPTX
Python pandas Library
Md. Sohag Miah
 
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
CSV JSON and XML files in Python.pptx
Ramakrishna Reddy Bijjam
 
Python and CSV Connectivity
Neeru Mittal
 
Unit 5 Introduction to Built-in Packages in python .pptx
grpvasundhara1993
 
Data analysis with pandas
Outreach Digital
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
AminaRepo
 
Panda data structures and its importance in Python.pdf
sumitt6_25730773
 
Pandas Dataframe reading data Kirti final.pptx
Kirti Verma
 
data science for engineering reference pdf
fatehiaryaa
 
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
kalai75
 
pandas.pdf
AjeshSurejan2
 
pandas (1).pdf
AjeshSurejan2
 
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
Data analystics with R module 3 cseds vtu
LalithauLali
 
Importing Data Sets | Importing Data Sets | Importing Data Sets
Ayxanhmdzad
 
BINARY files CSV files JSON files with example.pptx
Ramakrishna Reddy Bijjam
 
Unit 1 Ch 2 Data Frames digital vis.pptx
abida451786
 
Unit 3_Numpy_VP.pptx
vishnupriyapm4
 
Python pandas Library
Md. Sohag Miah
 
Ad

Recently uploaded (20)

PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Presentation on animal welfare a good topic
kidscream385
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Ad

The Pandas Chapter 5(Important Questions).pdf

  • 1. Module – 5 The PANDAS (Chapter-5)
  • 2. Reading & Writing Data I/O API Tools There are two main functions used in Pandas Library that helps in Data Analysis, which are known as I/O API and they are; 1. Readers 2. Writers Readers Writers read_csv to_csv read_excel to_excel read_hdf to_hdf read_sql to_sql read_json to_json read_html to_html read_stata to_stata read_clipboard to_clipboard read_pickle to_pickle read_msgpack to_msgpack (experimental) read_gbq to_gbq (experimental)
  • 3. CSV and Textual Files If the values in a row are separated by a comma, you have the CSV (comma-separated values) format. Other forms with tabular data separated by spaces or tabs are typically contained in text files of various types (generally with the extension .txt) pandas provides a set of functions specific for this type of file; • read_csv • read_table • to_csv
  • 4. Reading Data in CSV or Text Files The most common operation for a person approaching data analysis is to read the data contained in a CSV file, or at least in a text file. myCSV_01.csv white,red,blue,green,animal 1,5,2,3,cat 2,7,8,5,dog 3,3,6,7,horse 2,2,8,3,duck 4,4,2,1,mouse Since this file is comma-delimited, you can use the read_csv() function to read its content and convert it at the same time in a DataFrame object.
  • 5. CSV files are tabulated data in which the values on the same column are separated by commas. But since CSV files are considered text files, you can also use the read_table() function, but specifying the delimiter. You can notice that in the CSV file, headers to identify all the columns are in the first row. But this is not a general case, it often happens that the tabulated data begin directly from the first line.
  • 6. myCSV_02.csv 1,5,2,3,cat 2,7,8,5,dog 3,3,6,7,horse 2,2,8,3,duck 4,4,2,1,mouse In this case, then you could make sure that it is precisely pandas to assign default names to the columns by using the header option set to None.
  • 7. In addition, there is also the possibility to specify the names directly assigning a list of labels to the names option.
  • 8. In more complex cases, in which you want to create a DataFrame with a hierarchical structure by reading a CSV file, you can extend the functionality of the read_csv() function by adding the index_col option, assigning all the columns to be converted into indexes to it. myCSV_03.csv color,status,item1,item2,item3 black,up,3,4,6 black,down,2,6,7 white,up,5,5,5 white,down,3,3,2 white,left,1,2,1 red,up,2,2,2 red,down,1,1,4
  • 9. Using RegExp for Parsing TXT Files In other cases, it is possible that the files on which to parse the data do not show separators well defined as a comma or a semicolon. In these cases, the regular expressions come to our aid. In fact, you can specify a regexp within the read_table() function using the sep option.
  • 10. Usually, you think of the separators as special characters like commas, spaces, tabs, etc. but in reality you could consider separator characters as alphanumeric characters, or for example, as integers such as 0. In this example, you need to extract the numeric part from a TXT file, in which there is a sequence of characters with numerical values and literal characters are completely fused. Remember to use the header option set to None whenever the column headings are not present in the TXT file.
  • 11. Another fairly common event is to exclude lines from parsing. In fact you do not always want to include headers or unnecessary comments contained within a file. With the skiprows option you can exclude all the lines you want, just assigning an array containing the line numbers to not consider in parsing. If you want to exclude the first five lines, then you have to write skiprows = 5, but if we want to rule out the fifth line you have to write skiprows = [5].
  • 12. Reading TXT Files into Parts or Partially When large files are processed, or when you’re only interested in portions of these files, you often need to read the file into portions (chunks). So if for example you want to read only a portion of the file, you can explicitly specify the number of lines on which to parse. We can use the nrows and skiprows options, you can select the starting line n (n = SkipRows) and the lines to be read after it (nrows = i).
  • 13. If you want to write to a CSV file, the data contained in a DataFrame, we will use the to_csv() function that accepts as an argument the name of the file you generate.
  • 14. As you can see from the previous example, when you make the writing of a data frame to a file, by default both indexes and columns are marked on the file. This default behavior can be changed by placing the two options index and header set to False.
  • 15. One thing to take into account when making the writing of files is that NaN values present in a data structure are shown as empty fields in the file.
  • 16. We can replace this empty field with a value to your liking using the na_rep option in the to_csv() function. Common values may be NULL, 0, or the same NaN.
  • 17. Reading and Writing HTML Files With regard to the HTML format pandas provides the corresponding pair of I/O API functions. • read_html() • to_html() Writing Data in HTML See how to convert a DataFrame into an HTML table. The internal structure of the data frame is automatically converted into nested tags <TH>, <TR>, <TD> retaining any internal hierarchies. Example:
  • 19. To write an HTML page through the generation of a string. ❖ First of all we create a string that contains the code of the HTML page. ❖ Now you can write the contents from the string “html” directly on the file that will be called myFrame.html
  • 20. ❖ Now in your working directory will be a new HTML file, myFrame.html. Double-click it to open it directly from the browser. An HTML table will appear in the upper left as shown below;
  • 21. Reading Data from an HTML File
  • 22. Reading Data from XML In the list of I/O API functions, there is no specific tool regarding the XML (Extensible Markup Language) format. In fact, although it is not listed, this format is very important, because many structured data are available in XML format. This presents no problem, since Python has many other libraries (besides pandas) that manage the reading and writing of data in XML format. One of these libraries is the lxml library, which stands out for its excellent performance during the parsing of very large files. Example: • In this example you will take the data structure described in the XML file to convert it directly into a DataFrame. • To do so the first thing to do is use the sub-module objectify of the lxml library, importing it in the following way.
  • 23. >>> from lxml import objectify >>> xml = objectify.parse('books.xml') >>> xml <lxml.etree._ElementTree object at 0x0000000009734E08> You got an object tree, which is an internal data structure of the module lxml. Look in more detail at this type of object. To navigate in this tree structure, so as to select element by element, you must first define the root. You can do this with the getroot() function. >>> root = xml.getroot() Now that the root of the structure has been defined, you can access the various nodes of the tree, each corresponding to the tag contained within the original XML file. >>> root.Book.Author 'Ross, Mark' >>> root.Book.PublishDate
  • 24. ❖ In this way you access nodes individually, but you can access various elements at the same time using getchildren(). With this function, you’ll get all the child nodes of the reference element. >>> root.getchildren() [<Element Book at 0x9c66688>, <Element Book at 0x9c66e08>] ❖ With the tag attribute you get the name of the tag corresponding to the child node. >>> [child.tag for child in root.Book.getchildren()] ['Author', 'Title', 'Genre', 'Price', 'PublishDate’] ❖ with the text attribute you get the value contained between the corresponding tags. >>> [child.text for child in root.Book.getchildren()] ['Ross, Mark', 'XML Cookbook', 'Computer', '23.56', '2014-22-01']
  • 25. Reading and Writing Data on Microsoft Excel Files ❖ pandas provides specific functions to read and write data with Excel files and the I/O API that provides functions to this purpose are: • to_excel() • read_excel() ❖ The read_excel() function is able to read both Excel 2003 (.xls) files and Excel 2007 (.xlsx) files. Example: First, open an Excel file and enter the data as shown below. Copy data in sheet1 and sheet2. Then save it as data.xls.
  • 27. ❖ To read the data contained within the XLS file and obtain the conversion into a data frame, you only have to use the read_excel() function. ❖ As you can see, by default, the returned DataFrame is composed of the data tabulated in the first spreadsheets. ❖ If, however, you’d need to load the data in the second spreadsheet, then specify the name of the sheet or the number of the sheet (index) just as the second argument.
  • 28. ❖ To convert a data frame in a spreadsheet on Excel you have to write as follows. >>> frame = pd.DataFrame(np.random.random((4,4)), index = ['exp1','exp2','exp3','exp4'], columns = ['Jan2015','Fab2015','Mar2015','Apr2005’]) ❖ In the working directory you will find a new Excel file containing the data as shown;
  • 29. JSON Data ❖ JSON (JavaScript Object Notation) has become one of the most common standard formats, especially for the transmission of data through the Web. ❖ So it is normal to have to do with this data format if you want to use the available data on the Web. Step-1:
  • 30. ❖ The converse is possible, using the read_json() with the name of the file passed as an argument.
  • 31. ❖ Generally, however, the JSON files do not have a tabular structure. Thus, you will need to somehow convert the structure dict file in tabular form. ❖ The library pandas provides a function, called json_normalize(), that is able to convert a dict or a list in a table. ❖ First you have to import the function >>> from pandas.io.json import json_normalize ❖ Then write a JSON file as described below with any text editor. Save it in the working directory as books.json
  • 32. ❖ As you can see, the file structure is no longer tabular, but more complex. Then the approach with the read_json() function is no longer valid. ❖ You can still get the data in tabular form from this structure. ❖ First you have to load the contents of the JSON file and convert it into a string. ❖ Now you are ready to apply the json_normalize() function.
  • 33. Pickle—Python Object Serialization ❖ The pickle module implements a powerful algorithm for serialization and de-serialization of a data structure implemented in Python. ❖ Pickling is the process in which the hierarchy of an object is converted into a stream of bytes. ❖ In Python, the picking operation is carried out by the pickle module, but currently there is a module called cPickle which is the result of an enormous amount of work optimizing the pickle module (written in C). ❖ This module can be in fact in many cases even 1,000 times faster than the pickle module.
  • 34. Serialize a Python Object with cPickle ❖ The data format used by the pickle module (or cPickle) is specific to Python. ❖ By default, an ASCII representation is used to represent it, in order to be readable from the human point of view. ❖ Then opening a file with a text editor you may be able to understand its contents. ❖ To use this module you must first import it >>> import cPickle as pickle ❖ Then create an object sufficiently complex to have an internal data structure, for example a dict object. >>> data = { 'color': ['white','red'], 'value': [5, 7]}
  • 35. ❖ Now you will perform a serialization of the data object through the dumps() function of the cPickle module. >>> pickled_data = pickle.dumps(data) ❖ Now, to see how it was serialized the dict object, you need to look at the content of the pickled_data variable. ❖ Once serialized data, they can easily be written on a file, or sent over a socket, pipe, etc.
  • 36. Pickling with pandas ❖ As regards the operation of pickling (and unpickling) with the pandas library, everything remains much facilitated. No need to import the cPickle module in the Python session and also the whole operation is performed implicitly. ❖ Also, the serialization format used by pandas is not completely in ASCII. >>> frame = pd.DataFrame(np.arange(16).reshape(4,4), index = ['up','down','left','right’]) >>> frame.to_pickle('frame.pkl’) ❖ Now in your working directory there is a new file called frame.pkl containing all the information about the frame DataFrame. ❖ To open a PKL file and read the contents, simply use the command