难度: Medium
原题连接
内容描述
Write a bash script to calculate the frequency of each word in a text file words.txt.
For simplicity sake, you may assume:
words.txt contains only lowercase characters and space ' ' characters.
Each word must consist of lowercase characters only.
Words are separated by one or more whitespace characters.
Example:
Assume that words.txt has the following content:
the day is sunny the the
the sunny is is
Your script should output the following, sorted by descending frequency:
the 4
is 3
sunny 2
day 1
Note:
Don't worry about handling ties, it is guaranteed that each word's frequency count is unique.
Could you write it in one-line using Unix pipes?
思路 1
cat word.txt :
output the text in the file
tr '\n' ' ' :
substitute endlines with single space
sed "s/\s\s/ /g"* :
substitute multiple spaces with single space
awk -v RS=' ' '{print $0}' :
output one word per line by changing Record Separator in AWK to single space. $0 is the entire record (1 word).
sort :
sort alphabetically the list of words (with repetitions) to prepare it for uniq command.
uniq -c :
print the list of unique words with their count. Before uniq you need to sort the list of words.
sort -nr -k1 :
sort the list of unique words by their count (-nr numerical reverse sorting) (-k1 sort by the first field that is the count of repetitions for the current word)
awk '{print $2" "$1}' :
for each line print before the second field $2, that is the word, and then the first field that is the count of repetitions for the word.
cat words.txt | tr '\n' ' ' | sed "s/\s\s*/ /g" | awk -v RS=' ' '{print $0}' | sort | uniq -c | sort -nr -k1 | awk '{print $2" "$1}'