2. Reading & Writing Data
I/O API Tools
There are two main functions used in Pandas Library that helps in Data Analysis,
which are known as I/O API and they are;
1. Readers
2. Writers
Readers Writers
read_csv to_csv
read_excel to_excel
read_hdf to_hdf
read_sql to_sql
read_json to_json
read_html to_html
read_stata to_stata
read_clipboard to_clipboard
read_pickle to_pickle
read_msgpack to_msgpack (experimental)
read_gbq to_gbq (experimental)
3. CSV and Textual Files
If the values in a row are separated by a comma, you have the CSV
(comma-separated values) format.
Other forms with tabular data separated by spaces or tabs are
typically contained in text files of various types (generally with the
extension .txt)
pandas provides a set of functions specific for this type of file;
• read_csv
• read_table
• to_csv
4. Reading Data in CSV or Text Files
The most common operation for a person approaching data analysis is to read
the data contained in a CSV file, or at least in a text file.
myCSV_01.csv
white,red,blue,green,animal
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse
Since this file is comma-delimited, you can use the read_csv() function to
read its content and convert it at the same time in a DataFrame object.
5. CSV files are tabulated data in which the values on the same column are
separated by commas. But since CSV files are considered text files, you can
also use the read_table() function, but specifying the delimiter.
You can notice that in the CSV file, headers to identify all the columns are in
the first row. But this is not a general case, it often happens that the tabulated
data begin directly from the first line.
7. In addition, there is also the possibility to specify the names directly
assigning a list of labels to the names option.
8. In more complex cases, in which you want to create a DataFrame with a
hierarchical structure by reading a CSV file, you can extend the functionality of
the read_csv() function by adding the index_col option, assigning all the
columns to be converted into indexes to it.
myCSV_03.csv
color,status,item1,item2,item3
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
white,left,1,2,1
red,up,2,2,2
red,down,1,1,4
9. Using RegExp for Parsing TXT Files
In other cases, it is possible that the files on which to parse the data do not
show separators well defined as a comma or a semicolon.
In these cases, the regular expressions come to our aid.
In fact, you can specify a regexp within the read_table() function using the sep
option.
10. Usually, you think of the separators as special characters like commas,
spaces, tabs, etc. but in reality you could consider separator characters as
alphanumeric characters, or for example, as integers such as 0.
In this example, you need to extract the numeric part from a TXT file, in which
there is a sequence of characters with numerical values and literal characters
are completely fused.
Remember to use the header option set to None whenever the column
headings are not present in the TXT file.
11. Another fairly common event is to exclude lines from parsing. In fact you do not
always want to include headers or unnecessary comments contained within a
file.
With the skiprows option you can exclude all the lines you want, just assigning
an array containing the line numbers to not consider in parsing.
If you want to exclude the first five lines, then you have to write skiprows = 5,
but if we want to rule out the fifth line you have to write skiprows = [5].
12. Reading TXT Files into Parts or Partially
When large files are processed, or when you’re only interested in portions of
these files, you often need to read the file into portions (chunks).
So if for example you want to read only a portion of the file, you can explicitly
specify the number of lines on which to parse.
We can use the nrows and skiprows options, you can select the starting line n
(n = SkipRows) and the lines to be read after it (nrows = i).
13. If you want to write to a CSV file, the data contained in a DataFrame, we will
use the to_csv() function that accepts as an argument the name of the file you
generate.
14. As you can see from the previous example, when you make the writing of a
data frame to a file, by default both indexes and columns are marked on the
file. This default behavior can be changed by placing the two options index and
header set to False.
15. One thing to take into account when making the writing of files is that NaN values
present in a data structure are shown as empty fields in the file.
16. We can replace this empty field with a value to your liking using the na_rep
option in the to_csv() function.
Common values may be NULL, 0, or the same NaN.
17. Reading and Writing HTML Files
With regard to the HTML format pandas provides the corresponding pair of I/O
API functions.
• read_html()
• to_html()
Writing Data in HTML
See how to convert a DataFrame into an HTML table.
The internal structure of the data frame is automatically converted into nested
tags <TH>, <TR>, <TD> retaining any internal hierarchies.
Example:
19. To write an HTML page through the generation of a string.
❖ First of all we create a string that contains the code of the HTML page.
❖ Now you can write the contents from the string “html” directly on the file that will
be called myFrame.html
20. ❖ Now in your working directory will be a new HTML file, myFrame.html.
Double-click it to open it directly from the browser. An HTML table will appear in
the upper left as shown below;
22. Reading Data from XML
In the list of I/O API functions, there is no specific tool regarding the XML
(Extensible Markup Language) format.
In fact, although it is not listed, this format is very important, because many
structured data are available in XML format.
This presents no problem, since Python has many other libraries (besides
pandas) that manage the reading and writing of data in XML format.
One of these libraries is the lxml library, which stands out for its excellent
performance during the parsing of very large files.
Example:
• In this example you will take the data structure described in the XML file to
convert it directly into a DataFrame.
• To do so the first thing to do is use the sub-module objectify of the lxml library,
importing it in the following way.
23. >>> from lxml import objectify
>>> xml = objectify.parse('books.xml')
>>> xml
<lxml.etree._ElementTree object at 0x0000000009734E08>
You got an object tree, which is an internal data structure of the module lxml.
Look in more detail at this type of object. To navigate in this tree structure, so as to
select
element by element, you must first define the root. You can do this with the
getroot() function.
>>> root = xml.getroot()
Now that the root of the structure has been defined, you can access the various
nodes of the tree, each corresponding to the tag contained within the original XML
file.
>>> root.Book.Author
'Ross, Mark'
>>> root.Book.PublishDate
24. ❖ In this way you access nodes individually, but you can access various elements
at the same time using getchildren(). With this function, you’ll get all the child
nodes of the reference element.
>>> root.getchildren()
[<Element Book at 0x9c66688>, <Element Book at 0x9c66e08>]
❖ With the tag attribute you get the name of the tag corresponding to the child
node.
>>> [child.tag for child in root.Book.getchildren()]
['Author', 'Title', 'Genre', 'Price', 'PublishDate’]
❖ with the text attribute you get the value contained between the corresponding
tags.
>>> [child.text for child in root.Book.getchildren()]
['Ross, Mark', 'XML Cookbook', 'Computer', '23.56', '2014-22-01']
25. Reading and Writing Data on Microsoft Excel Files
❖ pandas provides specific functions to read and write data with Excel files and the
I/O API that provides functions to this purpose are:
• to_excel()
• read_excel()
❖ The read_excel() function is able to read both Excel 2003 (.xls) files and Excel
2007 (.xlsx) files.
Example:
First, open an Excel file and enter the data as shown below. Copy data in sheet1
and sheet2. Then save it as data.xls.
27. ❖ To read the data contained within the XLS file and obtain the conversion into a
data frame, you only have to use the read_excel() function.
❖ As you can see, by default, the returned DataFrame is composed of the data
tabulated in the first spreadsheets.
❖ If, however, you’d need to load the data in the second spreadsheet, then specify
the name of the sheet or the number of the sheet (index) just as the second
argument.
28. ❖ To convert a data frame in a spreadsheet on Excel you have to write as follows.
>>> frame = pd.DataFrame(np.random.random((4,4)),
index = ['exp1','exp2','exp3','exp4'],
columns = ['Jan2015','Fab2015','Mar2015','Apr2005’])
❖ In the working directory you will find a new Excel file containing the data as
shown;
29. JSON Data
❖ JSON (JavaScript Object Notation) has become one of the most common
standard formats, especially for the transmission of data through the Web.
❖ So it is normal to have to do with this data format if you want to use the
available data on the Web.
Step-1:
30. ❖ The converse is possible, using the read_json()
with the name of the file passed as an argument.
31. ❖ Generally, however, the JSON files do not have a tabular structure. Thus, you will
need to somehow convert the structure dict file in tabular form.
❖ The library pandas provides a function, called json_normalize(), that is able to
convert a dict or a list in a table.
❖ First you have to import the function
>>> from pandas.io.json import json_normalize
❖ Then write a JSON file as described below with any text editor. Save it in the
working directory as books.json
32. ❖ As you can see, the file structure is no longer tabular, but more complex. Then the
approach with the read_json() function is no longer valid.
❖ You can still get the data in tabular form from this structure.
❖ First you have to load the contents of the JSON file and convert it into a string.
❖ Now you are ready to apply the json_normalize() function.
33. Pickle—Python Object Serialization
❖ The pickle module implements a powerful algorithm for serialization and
de-serialization of a data structure implemented in Python.
❖ Pickling is the process in which the hierarchy of an object is converted into a
stream of bytes.
❖ In Python, the picking operation is carried out by the pickle module, but
currently there is a module called cPickle which is the result of an enormous
amount of work optimizing the pickle module (written in C).
❖ This module can be in fact in many cases even 1,000 times faster than the
pickle module.
34. Serialize a Python Object with cPickle
❖ The data format used by the pickle module (or cPickle) is specific to Python.
❖ By default, an ASCII representation is used to represent it, in order to be readable
from the human point of view.
❖ Then opening a file with a text editor you may be able to understand its contents.
❖ To use this module you must first import it
>>> import cPickle as pickle
❖ Then create an object sufficiently complex to have an internal data structure, for
example a dict object.
>>> data = { 'color': ['white','red'], 'value': [5, 7]}
35. ❖ Now you will perform a serialization of the data object through the dumps()
function of the cPickle module.
>>> pickled_data = pickle.dumps(data)
❖ Now, to see how it was serialized the dict object, you need to look at the
content of the pickled_data variable.
❖ Once serialized data, they can easily be written on a file, or sent over a socket,
pipe, etc.
36. Pickling with pandas
❖ As regards the operation of pickling (and unpickling) with the pandas library,
everything remains much facilitated. No need to import the cPickle module in the
Python session and also the whole operation is performed implicitly.
❖ Also, the serialization format used by pandas is not completely in ASCII.
>>> frame = pd.DataFrame(np.arange(16).reshape(4,4), index =
['up','down','left','right’])
>>> frame.to_pickle('frame.pkl’)
❖ Now in your working directory there is a new file called frame.pkl containing all
the information about the frame DataFrame.
❖ To open a PKL file and read the contents, simply use the command