Files (file streams, file processing, diagnosing stream problems).txt



If we use the notion of a canonical file name (a name which uniquely defines the location of the file regardless of its level in the directory tree) we can realize that these names look different in Windows and in Unix/Linux:

In addition, Unix/Linux system file names are case-sensitive. Windows systems store the case of letters used in the file name, but don't distinguish between their cases at all.

The main and most striking difference is that you have to use two different separators for the directory names: \ in Windows, and / in Unix/Linux.

Fortunately, there is also one more solution. Python is smart enough to be able to convert slashes into backslashes each time it discovers that it's required by the OS.

Any program written in Python (and not only in Python, because that convention applies to virtually all programming languages) does not communicate with the files directly, but through some abstract entities that are named differently in different languages or environments – the most-used terms are handles or streams (we'll use them as synonyms here).

To connect (bind) the stream with the file, it's necessary to perform an explicit operation.

The operation of connecting the stream with a file is called opening the file, while disconnecting this link is named closing the file.

The very first operation performed on the stream is always open and the last one is close. The program, in effect, is free to manipulate the stream between these two events and to handle the associated file.

There are two basic operations performed on the stream:

1-read from the stream: the portions of the data are retrieved from the file and placed in a memory area managed by the program (e.g., a variable);
2-write to the stream: the portions of the data from the memory (e.g., a variable) are transferred to the file.


There are three basic modes used to open the stream:

1-read mode: a stream opened in this mode allows read operations only; trying to write to the stream will cause an exception (the exception is named UnsupportedOperation, which inherits OSError and ValueError, and comes from the io module);
2-write mode: a stream opened in this mode allows write operations only; attempting to read the stream will cause the exception mentioned above;
3-update mode: a stream opened in this mode allows both writes and reads.


Python assumes that every file is hidden behind an object of an adequate class.

An object of an adequate class is created when you open the file and annihilate it at the time of closing.

Between these two events, you can use the object to specify what operations should be performed on a particular stream. 

The operations you're allowed to use are imposed by the way in which you've opened the file.


In general, the object comes from one of the classes shown here:
-IOBase
-RawIOBase
-BufferedIOBase
-TextIOBase

The only way you obtain them is to invoke the function named open().
If you want to get rid of the object, you invoke the method named close().

Due to the type of the stream's contents, all the streams are divided into text and binary streams.

The text streams are structured in lines; that is, they contain typographical characters (letters, digits, punctuation, etc). arranged in rows (lines), as seen with the naked eye when you look at the contents of the file in the editor.

This file is written (or read) mostly character by character, or line by line.

The binary streams don't contain text but a sequence of bytes of any value. This sequence can be, for example, an executable program, an image, an audio or a video clip, a database file, etc.

Because these files don't contain lines, the reads and writes relate to portions of data of any size. Hence the data is read/written byte by byte, or block by block, where the size of the block usually ranges from one to an arbitrarily chosen value.

In Unix/Linux systems, the line ends are marked by a single character named LF (ASCII code 10) designated in Python programs as \n.

Other operating systems, especially these derived from the prehistoric CP/M system (which applies to Windows family systems, too) use a different convention: the end of line is marked by a pair of characters, CR and LF (ASCII codes 13 and 10) which can be encoded as \r\n.

Since portability issues were (and still are) very serious, a decision was made to definitely resolve the issue in a way that doesn't engage the developer's attention.

the mechanism is completely transparent to the program, which can be written as if it was intended for processing Unix/Linux text files only; the source code run in a Windows environment will work properly, too;

The opening of the stream is performed by a function which can be invoked in the following way:
stream = open(file, mode = 'r', encoding = None)


Opening the streams: modes
--------------------------
r open mode: read
the stream will be opened in read mode;
the file associated with the stream must exist and has to be readable, otherwise the open() function raises an exception.

w open mode: write
the stream will be opened in write mode;
the file associated with the stream doesn't need to exist; if it doesn't exist it will be created; if it exists, it will be truncated to the length of zero (erased); if the creation isn't possible (e.g., due to system permissions) the open() function raises an exception.

a open mode: append
the stream will be opened in append mode;
the file associated with the stream doesn't need to exist; if it doesn't exist, it will be created; if it exists the virtual recording head will be set at the end of the file (the previous content of the file remains untouched).

r+ open mode: read and update
the stream will be opened in read and update mode;
the file associated with the stream must exist and has to be writeable, otherwise the open() function raises an exception;
both read and write operations are allowed for the stream.

w+ open mode: write and update
the stream will be opened in write and update mode;
the file associated with the stream doesn't need to exist; if it doesn't exist, it will be created; the previous content of the file remains untouched;
both read and write operations are allowed for the stream.


If there is a letter b at the end of the mode string, it means that the stream is to be opened in binary mode.
If the mode string ends with a letter t, the stream is opened in text mode.


The successful opening of a file will set the current file position (the virtual reading/writing head) before the first byte of the file if the mode is not a and after the last byte of the file if the mode is set to a.


The IOError object is equipped with a property named errno (the name comes from the phrase error number) and you can access it as follows:

try:
    # Some stream operations.
except IOError as exc:
    print(exc.errno)


Let's take a look at some selected constants useful for detecting stream errors:

errno.EACCES → Permission denied
The error occurs when you try, for example, to open a file with the read only attribute for writing.

errno.EBADF → Bad file number
The error occurs when you try, for example, to operate with an unopened stream.

errno.EEXIST → File exists
The error occurs when you try, for example, to rename a file with its previous name.

errno.EFBIG → File too large
The error occurs when you try to create a file that is larger than the maximum allowed by the operating system.

errno.EISDIR → Is a directory
The error occurs when you try to treat a directory name as the name of an ordinary file.

errno.EMFILE → Too many open files
The error occurs when you try to simultaneously open more streams than acceptable for your operating system.

errno.ENOENT → No such file or directory
The error occurs when you try to access a non-existent file/directory.

errno.ENOSPC → No space left on device
The error occurs when there is no free space on the media.


Fortunately, there is a function that can dramatically simplify the error handling code.
Its name is strerror(), and it comes from the os module and expects just one argument – an error number.
Its role is simple: you give an error number and get a string describing the meaning of the error.


Reading a text file's contents can be performed using several different methods – none of them is any better or worse than any other. It's up to you which of them you prefer.

The most basic of these methods is the one offered by the read() function, which you were able to see in action in the previous lesson.

If applied to a text file, the function is able to:

read a desired number of characters (including just one) from the file, and return them as a string;
read all the file contents, and return them as a string;
if there is nothing more to read (the virtual reading head reaches the end of the file), the function returns an empty string.


If you want to treat the file's contents as a set of lines, not a bunch of characters, the readline() method will help you with that.
The method tries to read a complete line of text from the file, and returns it as a string in the case of success. Otherwise, it returns an empty string.


Another method, which treats text file as a set of lines, not characters, is readlines().
The readlines() method, when invoked without arguments, tries to read all the file contents, and returns a list of strings, one element per file line.


The method is named write() and it expects just one argument – a string that will be transferred to an open file (don't forget – open mode should reflect the way in which the data is transferred – writing a file opened in read mode won't succeed).

No newline character is added to the write()'s argument, so you have to add it yourself if you want the file to be filled with a number of lines.


ByteArray

Amorphous data cannot be stored using any of the previously presented means – they are neither strings nor lists.

There should be a special container able to handle such data.

Python has more than one such container – one of them is a specialized class name bytearray – as the name suggests, it's an array containing (amorphous) bytes.

data = bytearray(10)
Such an invocation creates a bytearray object able to store ten bytes.

Bytearrays resemble lists in many respects. For example, they are mutable, they're a subject of the len() function, and you can access any of their elements using conventional indexing.

There is one important limitation – you mustn't set any byte array elements with a value which is not an integer (violating this rule will cause a TypeError exception) and you're not allowed to assign a value that doesn't come from the range 0 to 255 inclusive (unless you want to provoke a ValueError exception).


Reading from a binary file requires the use of a specialized method name readinto(), as the method doesn't create a new byte array object, but fills a previously created one with the values taken from the binary file

An alternative way of reading the contents of a binary file is offered by the method named read().
Invoked without arguments, it tries to read all the contents of the file into the memory, making them a part of a newly created object of the bytes class.


1. To read a file’s contents, the following stream methods can be used:

read(number) – reads the number characters/bytes from the file and returns them as a string; is able to read the whole file at once;
readline() – reads a single line from the text file;
readlines(number) – reads the number lines from the text file; is able to read all lines at once;
readinto(bytearray) – reads the bytes from the file and fills the bytearray with them;

2. To write new content into a file, the following stream methods can be used:

write(string) – writes a string to a text file;
write(bytearray) – writes all the bytes of bytearray to a file;

3. The open() method returns an iterable object which can be used to iterate through all the file's lines inside a for loop. For example:

for line in open("file", "rt"):
    print(line, end='')
 
The code copies the file's contents to the console, line by line. Note: the stream closes itself automatically when it reaches the end of the file.