diff --git a/_config.yml b/_config.yml index 3d622aa..0715ae9 100644 --- a/_config.yml +++ b/_config.yml @@ -18,11 +18,8 @@ map: - title: Introducing IPython Notebook path: /core/notebook.html caption: A whole new way to work with Python! - - title: Working With Text Files - path: /core/text-files.html - caption: What is a text file? How do we get them in and out of Python? - - title: Working With Strings - path: /core/strings.html + - title: A typical problem -- Analyzing a survey + path: /core/survey.html caption: Once we have our text in Python, what can we do with it? - title: Creating Charts path: /core/charts.html diff --git a/core/strings.md b/core/survey.md similarity index 82% rename from core/strings.md rename to core/survey.md index 0c3d123..6ec608d 100644 --- a/core/strings.md +++ b/core/survey.md @@ -1,13 +1,11 @@ --- layout: ots -title: Working with Strings +title: A typical problem -- Analyzing a survey --- -# A problem - -Now we know how to work with text files, we'll use that knowledge to solve a problem: +# Our very, very important problem Suppose you're a greengrocer, and you run a survey to see what radish varieties your customers prefer the most. You have your assistant type up the survey results into a text file on your computer, so you have 300 lines of survey data in the file [radishsurvey.txt](../files/radishsurvey.txt). Each line consists of a name, a hyphen, then a radish variety: @@ -26,6 +24,8 @@ Suppose you're a greengrocer, and you run a survey to see what radish varieties Radishes radishes radishes +(You may have noticed that this is a very simple file: Unlike on a document or web page, there is no formatting whatsoever. It doesn't look pretty, but it has one big advantage: This is the simplest type of text format to work with on a computer, so it is also the most easy to process and analyze.) + You want to know: * What's the most popular radish variety? @@ -41,11 +41,12 @@ You want to know: Save the file [radishsurvey.txt](../files/radishsurvey.txt) to your computer. How do we write a program to find out which person voted for each radish preference? -From the previous chapter, we know that we can easily go through the file line by line, and each line will have a value like `"Jin Li - White Icicle\n"`. We also know that we can strip off the trailing newline with the `strip()` method: +We can easily open the file with Python and go through the file line by line. Each line will have a value like `"Jin Li - White Icicle\n"`. Then we can strip off the trailing newline with the `strip()` method. (If you are curious, you can look at the documentation for [open](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) and [split](https://docs.python.org/3/library/stdtypes.html?highlight=strip#str.strip) ) - for line in open("radishsurvey.txt"): - line = line.strip() - # Do something with each line + whith open("radishsurvey.txt") as file: + for line in file: + line = line.strip() + # Do something with each line We need a way to split each line into the name and the vote. Thankfully, Python comes with dozens of string methods including one called `split()`. [Have a look at the documentation for split()](http://docs.python.org/3.3/library/stdtypes.html#str.split) and see if you can figure out how to split each line into the name and the vote. @@ -53,12 +54,13 @@ We need a way to split each line into the name and the vote. Thankfully, Python ### Solution - for line in open("radishsurvey.txt"): - line = line.strip() - parts = line.split(" - ") - name = parts[0] - vote = parts[1] - print(name + " voted for " + vote) + whith open("radishsurvey.txt") as file: + for line in file: + line = line.strip() + parts = line.split(" - ") + name = parts[0] + vote = parts[1] + print(name + " voted for " + vote) There's a few things going on here, so let's go through it line by line. *Walking through a program in your head and thinking about what each line does by itself is a good way to start to understand it* @@ -105,12 +107,13 @@ Use the previous example as a base. You'll need to compare the vote with the str ### Solution - for line in open("radishsurvey.txt"): - line = line.strip() - parts = line.split(" - ") - name, vote = parts - if vote == "White Icicle": - print(name + " likes White Icicle!") + with open("radishsurvey.txt") as file: + for line in file: + line = line.strip() + parts = line.split(" - ") + name, vote = parts + if vote == "White Icicle": + print(name + " likes White Icicle!") You might notice that the code splitting the line has become even shorter here. Instead of assigning each element of parts separately, we can assign them together using a technique called "multiple assignment". The line `name, vote = parts` means to assign each variable to the corresponding item in the list. @@ -159,11 +162,12 @@ Use your previous solution as a base. You'll need a variable to hold the number print("Counting votes for White Icicle...") count = 0 - for line in open("radishsurvey.txt"): - line = line.strip() - name, vote = line.split(" - ") - if vote == "White Icicle": - count = count + 1 + whith open("radishsurvey.txt") as file: + for line in file: + line = line.strip() + name, vote = line.split(" - ") + if vote == "White Icicle": + count = count + 1 print(count) @@ -178,11 +182,12 @@ Using your function, can you write a program which counts votes for White Icicle def count_votes(radish): print("Counting votes for " + radish + "...") count = 0 - for line in open("radishsurvey.txt"): - line = line.strip() - name, vote = line.split(" - ") - if vote == radish: - count = count + 1 + whith open("radishsurvey.txt") as file: + for line in file: + line = line.strip() + name, vote = line.split(" - ") + if vote == radish: + count = count + 1 return count print(count_votes("White Icicle")) @@ -241,15 +246,16 @@ Remember that for dictionaries `counts[vote]` means "the value in `counts` which # with vote counts counts = {} - for line in open("radishsurvey.txt"): - line = line.strip() - name, vote = line.split(" - ") - if vote not in counts: - # First vote for this variety - counts[vote] = 1 - else: - # Increment the vote count - counts[vote] = counts[vote] + 1 + whith open("radishsurvey.txt") as file: + for line in file: + line = line.strip() + name, vote = line.split(" - ") + if vote not in counts: + # First vote for this variety + counts[vote] = 1 + else: + # Increment the vote count + counts[vote] = counts[vote] + 1 print(counts) ### Pretty printing @@ -319,17 +325,18 @@ There are lots of functions which could remove the case distinction. `str.lower( # with vote counts counts = {} - for line in open("radishsurvey.txt"): - line = line.strip() - name, vote = line.split(" - ") - # munge the vote string to clean it up - vote = vote.strip().capitalize() - if not vote in counts: - # First vote for this variety - counts[vote] = 1 - else: - # Increment the vote count - counts[vote] = counts[vote] + 1 + whith open("radishsurvey.txt") as file: + for line in file: + line = line.strip() + name, vote = line.split(" - ") + # munge the vote string to clean it up + vote = vote.strip().capitalize() + if not vote in counts: + # First vote for this variety + counts[vote] = 1 + else: + # Increment the vote count + counts[vote] = counts[vote] + 1 print(counts) If you're having trouble spotting the difference here, it's @@ -386,24 +393,25 @@ This is just one of many ways to do this: # Create an empty list with the names of everyone who voted voted = [] - for line in open("radishsurvey.txt"): - line = line.strip() - name, vote = line.split(" - ") - # clean up the person's name - name = name.strip().capitalize().replace(" "," ") - # check if this person already voted - if name in voted: - print(name + " has already voted! Fraud!") - continue - voted.append(name) - # munge the vote string to clean it up - vote = vote.strip().capitalize().replace(" "," ") - if not vote in counts: - # First vote for this variety - counts[vote] = 1 - else: - # Increment the vote count - counts[vote] += 1 + whith open("radishsurvey.txt") as file: + for line in file: + line = line.strip() + name, vote = line.split(" - ") + # clean up the person's name + name = name.strip().capitalize().replace(" "," ") + # check if this person already voted + if name in voted: + print(name + " has already voted! Fraud!") + continue + voted.append(name) + # munge the vote string to clean it up + vote = vote.strip().capitalize().replace(" "," ") + if not vote in counts: + # First vote for this variety + counts[vote] = 1 + else: + # Increment the vote count + counts[vote] += 1 print("Results:") print() @@ -473,15 +481,16 @@ This is just one possible way to break it down: counts[radish] = counts[radish] + 1 - for line in open("radishsurvey.txt"): - line = line.strip() - name, vote = line.split(" - ") - name = clean_string(name) - vote = clean_string(vote) - - if not has_already_voted(name): - count_vote(vote) - voted.append(name) + whith open("radishsurvey.txt") as file: + for line in file: + line = line.strip() + name, vote = line.split(" - ") + name = clean_string(name) + vote = clean_string(vote) + + if not has_already_voted(name): + count_vote(vote) + voted.append(name) print("Results:") print() @@ -524,7 +533,7 @@ The loop shown above keeps track of one name, `winner_name`, and the number of v ## Challenge -Can you refactor the part of the program that finds the winner into a function? +Can you extract the part of the program that finds the winner into a function? ## Bigger Challenge @@ -534,4 +543,4 @@ Can you write a winner function that could deal with a tie? ## Next Chapter -When you're done counting radish votes, the next chapter is [Creating Charts](charts.html) +That became complicated pretty quickly, didn't it? In the next chapter, we will try an easier way to [analyze the survey using pandas](pandas.html), a Python library designed for data analysis. diff --git a/core/text-files.md b/core/text-files.md deleted file mode 100644 index ad1e5ea..0000000 --- a/core/text-files.md +++ /dev/null @@ -1,280 +0,0 @@ ---- - -layout: ots -title: Working With Text Files - ---- - -# What's a text file? - -A text file is any file containing only readable characters. - -
-Book of Hours, Latin with additions in Middle English; second page of the second Middle English prayer to Christ. Southern Netherlands (probably Bruges), ca. 1440, f.45r -
- -
-Untitled -
- -
-IBM 1403 printout (from the power-of-two program) -
- -A character can be a number like 3 or 6, or a letter of the alphabet like M or p. Taken together, programmers call numbers and letters the set of *alphanumeric* characters. - -Characters also include non-alphanumeric symbols like # or $, or even more exotic symbols like 汉 or Й. Each of these is a single character - -(The last characters in the paragraph above will only appear correctly if your browser is using a font that supports Simplified Chinese and Cyrillic characters, respectively.) - -Symbols in text files can have special meanings, for example Python source code files are a type of plain text file. - -HTML files are another kind of plain text file. Even though HTML tags like <i> or <div> mean special things to a web browser they are still stored in a plain text format that can be viewed in any text editor. - - -# What isn't a text file? - -The opposite of text files, "binary" files are any files where the format isn't made up of readable characters. Binary files can range from image files like JPEGs or GIFs, audio files like MP3s or binary document formats like Word or PDF. - -This section has been a bit dry, so here's a link to a binary GIF file of a kitten. - -The main difference between a text file and a binary file is that binary files need special programs (or knowledge of the special format) to make sense. Text files can be edited by any program that edits plain text, and are easy to process in programming languages like Python. - -# Reading files into Python - -Download this text file, [months.txt](../files/months.txt) containing names of months of the year - - January - February - March - -... etc - -What do you think the following code does, if you run it in the same directory as *months.txt*? - - f = open("months.txt") - print(f.read()) - -Try it out in an IPython Notebook cell. - -### Solution: - -It prints out the contents of the text file. - -## What's really happening here? - -* The `open` function creates a *file object* (a way of getting at the contents of the file), which is then stored in the variable `f`. - -* `f.read()` tells the file object to read the full contents of the file, and return it as a string. - -## Reading by smaller pieces - -`read()` can also take an argument, which is the maximum number of characters to read from a file. - -Once `read()` has reached the end of the file, it returns an empty string (zero characters, the string "") - -Can you work out what this code would do? - - f = open("months.txt") - next = f.read(1) - while next != "": - print(next) - next = f.read(1) - -### Solution: -It prints the contents of the file, one character at a time, until the end of the file is reached: - - J - a - n - u - a - r - y - - F - e - b - r - -... and so on - -### Bonus Question #1 - -What is the `while` statement in the above code doing? When does the program exit the while loop? - -Think about the value that the variable `next` has each time the while loop is evaluated. What happens when the end of the file is reached? - -### Bonus Question #2 - -What would happen if you replaced the `read(1)`s in the code above with `read(2)`s? Think about it first, then try it and see what happens! - - -# Reading files line by line - -So far we can read a whole file, or we can read a certain number of characters from a file. How about if you just want to read a single line from a file? - -## How lines are represented - -In text files, lines are broken up by special invisible characters that mark *end of line*. Invisible characters like these are sometimes called *control characters*. - -In Python, the control character for the end of a line is always represented as `\n`. You can use `\n` in a string anywhere that you want to break a line: - - print("I want two lines!\nThe newline character gives me two lines.") - -Produces: - - I want two lines! - The newline character gives me two lines. - -Control characters like `\n` date from the days when computers had typewriter style interfaces (see "Teletype machines".) Characters had literal meanings like *Press the Carriage Return Lever* or *Press the Line Feed button*. - -A Teletype Machine - -### Enough about typewriters! - -Yes, back to Python files! To read a file line by line you could just keep reading one character at a time with `.read(1)`, until you run into a newline character `\n`. - -There's an easier way though, which is to use the `.readline()` method in place of `.read()`. - -Have another look at the one-character-per-line code example from earlier in this chapter. Can you modify it to read from the file line by line instead of character by character? - -Hint: In IPython Notebook, you can copy a whole cell by choosing Edit -> Copy Cell. That way you can keep the character-per-line example unmodified and create a new cell for the line by line code. - -### Solution: - - f = open("months.txt") - next = f.readline() - while next != "": - print(next) - next = f.readline() - -How's that look? Something's not quite right, is it? - - January - - February - - March - -... `readline()` also returns the newline `\n` at the end of the line, and `print()` automatically appends a newline as well - having two newlines in a row means there is a the blank line between each line with content. - -You can strip newlines (and other "whitespace" characters) from each end of a string by using the [.strip() method](http://docs.python.org/3.3/library/stdtypes.html#str.strip). We used this briefly in the Introduction to Python course. - -Can you remember how to use it? (take a look at the documentation link above to refresh your memory.) - -See if you can modify your program to strip off the extra newlines. - -### Solution: - -You can either replace - - print(next) - -With - - print(next.strip()) - -Or you can make it into two lines, like this: - - next = next.strip() - print(next) - -In the two line alternative you update the `next` variable to hold the "stripped" version. This might be useful if you intend to use the value of `next` again, later on. - -Strip by default removes whitespace characters from the ends of strings (including `\n` but also spaces or tabs.) Another way you can use it is to tell it specifically to *only* remove `\n`, nothing else: - - next = next.strip("\n") - print(next) - -## Reading Every Line - -`readline()` will let you read through a file line by line. However, there are two even easier ways to read an entire file this way: - - f = open("months.txt") - print(f.readlines()) - -The `readlines()` method reads all the lines in a file and returns them as a Python list. - -You can then iterate over the list of lines like this: - - f = open("months.txt") - for month in f.readlines(): - print("Month " + month.strip()) - -In fact, you don't even have to call `readlines()` - Python assumes that if you try to iterate through a text file with a for loop, you probably want to iterate through it line by line: - - f = open("months.txt") - for month in f: - print("Month " + month.strip()) - -# Writing to files - -When you `open()` a file, you can optionally specify a *file mode*, which tells Python what you want to do with the file. The default mode is `r` for read, but another mode is `w` to write to a file. - - f = open("awesomenewfile.txt", "w") - -Tip: the write (`w`) mode will write completely new contents to a file, wiping out what it had previously! - -There are actually a whole lot of file modes, `r` and `w` are just the most common. [There is a full list in the Python documentation for the open function](http://docs.python.org/3/library/functions.html#open) or you can type `open?` in IPython Notebook and run it to see the help displayed there. - -Can you guess how to write a string to a file in Python? - -### Hint - -File objects have a method for writing. You can find out about it by viewing the built-in help for the file object. In IPython Notebook you can type: - - f = open("awesomenewfile.txt", "w") - f? - -Or just type `f.` into IPython Notebook and then press "Tab" to view an automatic list of possible completions (NB: this only works if you've already run the cell once before to assign a value to "f".) - -### Solution - - f = open("awesomenewfile.txt", "w") - f.write("Awesome message!") - f.close() - -### Why do you use 'print' to write things on the console, but 'write' for files? - -I don't know. I think it's just been that way for as long as Python has been around. - -**There is one important difference between print and write**, `print()` automatically ends the line. `write()` doesn't, if you want to end the line you'll need to add the newline character `\n` yourself. - -Can you write a program which creates a two line text file? - -### Closing Files - -The last part of the solution is to `close()` the file when you're done. This is good practice to "clean up" after yourself. Changes may not show up in the file until you've closed it. - -You can `close()` files that you've opened for reading as well. - - -### Hint - -To look at the contents of a text file, you can open it in your text editor. - -Alternatively, if you're using OS X or Linux you can type `cat ` in a terminal (not a Python interpreter, the plain terminal shell you run Python from), to print the contents out. In the Windows terminal, you can use `type ` to do the same thing. - -### Solution: - - f = open("mylongfile.txt", "w") - f.write("First line\n") - f.write("Second line") - f.close() - -### Exercise! - -If you want to try all this out, here's a quick exercise to make sure you've got everything down pat. First, use a text editor to create a plain text file with a few lines of random text. Then write a Python program that: - -1. Reads all the lines from your text file into a list. -2. Appends something crazy to each line in the list. " Ya mum!" is nicely innapropriate, if you're struggling for ideas. -3. Writes all lines in that list into a new file. Check out your handy work by looking in the new file! - -When writing to the files, remember that `print()` adds a newline but with `write()` you have to add the newline yourself. - -If you're not sure where to start, one approach is to modify one of the solutions to the previous exercises. If you get stuck, remember you can always call over a coach for some advice. - -## Next Chapter - -Now we're ready to process some of the text data from the files, in [Working With Strings](strings.html). diff --git a/files/months.txt b/files/months.txt deleted file mode 100644 index 09c4565..0000000 --- a/files/months.txt +++ /dev/null @@ -1,12 +0,0 @@ -January -February -March -April -May -June -July -August -September -October -November -December \ No newline at end of file