3. 3
INTRODUCTION
Tests
âą Files are stored on auxiliary or secondary storage devices. The two most
common forms of secondary storage are disk and tape. Files in secondary
storage can be both read from and written to.
âą For our purposes, a file is a collection of data records in which each record
consists of one or more fields.
âą When we design a file, the important issue is how we will retrieve
information (a specific record) from the file. Sometimes we need to process
records one after another, whereas sometimes we need to access a specific
record quickly without retrieving the preceding records. The access method
determines how records can be retrieved: sequentially or randomly.
4. 4
A taxonomy of file structures
Tests
Sequential access
âą If we need to access a file sequentiallyâthat is, one record after another,
from beginning to endâwe use a sequential file structure.
Random access
âą If we need to access a specific record without having to retrieve all records
before it, we use a file structure that allows random access. Two file
structures allow this: indexed files and hashed files.
5. 5
SEQUENTIAL FILES
Tests
âą A sequential file is one in which records can only be accessed one after
another from beginning to end. Records are stored one after another in
auxiliary storage, such as tape or disk, and there is an EOF (end-of-file)
marker after the last record.
âą The operating system has no information about the record addresses, it only
knows where the whole file is stored. The only thing known to the operating
system is that the records are sequential.
7. 7
Updating sequential files
Tests
âą Sequential files must be updated periodically to reflect changes in
information. The updating process is very involved because all the records
need to be checked and updated (if necessary) sequentially.
9. 9
INDEXED FILES
Tests
âą To access a record in a file randomly, we need to know the address of the
record.
âą For example, suppose a customer wants to check their bank account. Neither
the customer nor the teller knows the address of the customerâs record. The
customer can only give the teller their account number (key). Here, an
indexed file can relate the account number (key) to the record address.
10. 10
INDEXED FILES
Tests
âą An indexed file is made of a data file, which is a sequential file, and an index.
âą The index itself is a very small file with only two fields: the key of the
sequential file and the address of the corresponding record on the disk.
âą The index is sorted based on the key values of the data files.
11. 11
Accessing a record in INDEXED FILES
Tests
âą The entire index file is loaded into main memory (the file is small and uses
little memory).
âą The index entries are searched, using an efficient search algorithm such as a
binary search, to find the desired key.
âą The address of the record is retrieved.
âą Using the address, the data record is retrieved and passed to the user.
12. 12
Inverted files
Tests
âą One of the advantages of indexed files is that we can have more than one
index, each with a different key.
âą For example, an employee file can be retrieved based on either social
security number or last name.
âą This type of indexed file is usually called an inverted file.
13. 13
HASHED FILES
Tests
âą In an indexed file, the index maps the key to the address.
âą A hashed file uses a mathematical function to accomplish this mapping.
âą The user gives the key, the function maps the key to the address and passes
it to the operating system, and the record is retrieved.
14. 14
Direct hashing
Tests
âą In direct hashing, the key is the data file address without any algorithmic
manipulation.
âą The file must therefore contain a record for every possible key. Although
situations suitable for direct hashing are limited, it can be very powerful
because it guarantees that there are no synonyms or collisions.
15. 15
Modulo division hashing
Tests
âą Also known as division remainder hashing, the modulo division method
divides the key by the file size and uses the remainder plus 1 for the address.
âą This gives the simple hashing algorithm that follows, where list_size is the
number of elements in the file. The reason for adding a 1 to the mod
operation result is that our list starts with 1 instead of 0:
16. 16
Modulo division hashing
Tests
âą This algorithm works with any list size, a list size that is a prime number
produces fewer collisions than other list sizes. Therefore, whenever possible,
try to make the file size a prime number.
17. 17
Digit extraction hashing
Tests
âą Using digit extraction hashing, selected digits are extracted from the key and
used as the address.
âą For example, using our six-digit employee number to hash to a three-digit
address (000â999), we could select the first, third, and fourth digits (from
the left) and use them as the address.
18. 18
Collision
Tests
âą Generally, the population of keys for a hashed list is greater than the number
of records in the data file.
âą For example, if we have a file of 50 students for a class in which the students
are identified by the last four digits of their social security number, then
there are 200 possible keys for each element in the file (10000/50).
âą Because there are many keys for each address in the file, there is a possibility
that more than one key will hash to the same address in the file. We call the
set of keys that hash to the same address in our list synonyms.
19. 19
Collision
Tests
âą If the actual data that we insert into our list contains two or more synonyms,
we will have collisions.
âą A collision is the event that occurs when a hashing algorithm produces an
address for an insertion key, but that address is already occupied.
âą The address produced by the hashing algorithm is known as the home
address.
âą The part of the file that contains all the home addresses is known as the
prime area.
âą When two keys collide at a home address, we must resolve the collision by
placing one of the keys and its data in another location, outside the prime
area.
20. 20
Collision resolution
Tests
Open addressing
âą The first collision resolution method, open addressing resolution, resolves
collisions in the prime area.
âą When a collision occurs, the prime area addresses are searched for an open
or unoccupied record where the new data can be placed.
âą One simple strategy for data that cannot be stored in the home address is to
store it in the next address (home address + 1).
21. 21
Collision resolution
Tests
Linked list resolution
âą A major disadvantage of open addressing is that each collision resolution
increases the probability of future collisions.
âą This disadvantage is eliminated in another approach to collision resolution,
linked list resolution. In this method, the first record is stored in the home
address, but contains a pointer to the second record.
22. 22
Collision resolution
Tests
Bucket hashing
âą Another approach to handling the problem of collisions is to hash to buckets.
âą A bucket is a node that can accommodate more than one record.
âą The disadvantage of this method is that there may be a lot of wasted
(unoccupied) locations.
23. 23
DIRECTORIES
Tests
âą Directories are provided by most operating systems for organizing files.
âą A directory performs the same function as a folder in a filing cabinet.
However, a directory in most operating systems is represented as a special
type of file that holds information about other files.
âą A directory not only serves as a kind of index that tells the operating system
where files are located on an auxiliary storage device, but can also contain
other information about the files it contains, such as who has access to each
file, or the date when each file was created, accessed, or modified.
25. 25
Special directories
Tests
Root Directory
âą The root directory is the highest level in the file system hierarchy.
âą It is the root of the whole file structure, and therefore does not have a
parent directory.
âą In a UNIX environment, the root directory always has several levels of
subdirectories. The root directory belongs to the system administrator and
can be changed only by the system administrator.
Home Directory
âą We use our home directory when we first log into the system. This contains
any files we create while in it and may contain personal system files.
âą Our home directory is also the beginning of our personal directory structure.
Each user has a home directory.
26. 26
Special directories
Tests
Working directory
âą The working directory (or current directory) is the directory we are âinâ at any
point in a user session.
âą When we first log in, the working directory is our home directory. If we have
subdirectories, we will most likely move from our home directory to one or
more subdirectories as needed during a session.
âą When we change directory, our working directory changes automatically.
Parent directory
âą The parent directory is the directory immediately above the working
directory. When we are in our home directory, its parent is one of the system
directories.
27. 27
Paths and pathnames
Tests
âą Every directory and file in a file system must have a name. In a directory, we
will note that there are some files that have the same names as files in other
directories.
âą It should be obvious, therefore, that we need more than just the filename to
identify them. To uniquely identify a file, therefore, we need to specify the
fileâs path from the root directory to the file.
âą The fileâs path is specified by its absolute pathname, a list of all directories
separated by a slash character (/).
âą This full or absolute pathname can get quite long. For that reason, UNIX also
provides a shorter pathname under certain circumstances, known as a
relative pathname, which is the path relative to the working directory.
28. 28
TEXT VERSUS BINARY
Tests
âą Two terms are used to categorize files: text files and binary files.
âą A file stored on a storage device is a sequence of bits that can be interpreted
by an application program as a text file or a binary file.
29. 29
Text files
Tests
âą A text file is a file of characters. It cannot contain integers, floating-point
numbers, or any other data structures in their internal memory format.
âą To store these data types, they must be converted to their character
equivalent formats.
âą Letâs look at an example. When data (a file stream) is sent to the printer, the
printer takes eight bits, interprets them as a byte, and decodes them into the
encoding system of the printer (ASCII or EBCDIC).
âą If the character belongs to the printable category, it will be printed,
otherwise some other activity takes place, such as printing a space.
âą The printer takes the next eight bits and repeats the process. This is done
until the file stream is exhausted.
30. 30
Binary files
Tests
âą A binary file is a collection of data stored in the internal format of the
computer.
âą In this definition, data can be an integer (including other data types
represented as unsigned integers, such as image, audio, or video), a floating-
point number, or any other structured data (except a file).
âą Unlike text files, binary files contain data that is meaningful only if it is
properly interpreted by a program.
âą If the data is textual, one byte is used to represent one character. But if the
data is numeric, two or more bytes are considered a data item.
âą For example, assume we are using a personal computer that uses two bytes
to store an integer. In this case, when we read or write an integer, two bytes
are interpreted as one integer.