|
1 | 1 | ## Unix Data Tools |
2 | 2 |
|
| 3 | +### `head` and `tail` for inspecting text files |
| 4 | + |
| 5 | +Use `head` to view the first few lines (default is 10 lines) of the file `Mus_musculus.GRCm38.75_chr1.bed` |
| 6 | + |
| 7 | +``` |
| 8 | +head Mus_musculus.GRCm38.75_chr1.bed |
| 9 | +``` |
| 10 | + |
| 11 | +``` |
| 12 | +1 3054233 3054733 |
| 13 | +1 3054233 3054733 |
| 14 | +1 3054233 3054733 |
| 15 | +1 3102016 3102125 |
| 16 | +1 3102016 3102125 |
| 17 | +1 3102016 3102125 |
| 18 | +1 3205901 3671498 |
| 19 | +1 3205901 3216344 |
| 20 | +1 3213609 3216344 |
| 21 | +1 3205901 3207317 |
| 22 | +``` |
| 23 | + |
| 24 | +We can choose the number of lines to display using the `-n` switch: |
| 25 | + |
| 26 | +``` |
| 27 | +head -n 3 Mus_musculus.GRCm38.75_chr1.bed |
| 28 | +``` |
| 29 | + |
| 30 | +``` |
| 31 | +1 3054233 3054733 |
| 32 | +1 3054233 3054733 |
| 33 | +1 3054233 3054733 |
| 34 | +``` |
| 35 | + |
| 36 | +Use `tail` to view the last three lines of the file `Mus_musculus.GRCm38.75_chr1.bed`: |
| 37 | + |
| 38 | +``` |
| 39 | +tail -n 3 Mus_musculus.GRCm38.75_chr1.bed |
| 40 | +``` |
| 41 | + |
| 42 | +``` |
| 43 | +1 195240910 195241007 |
| 44 | +1 195240910 195241007 |
| 45 | +1 195240910 195241007 |
| 46 | +``` |
| 47 | + |
| 48 | +View the start *and* the end of a file with a single command: |
| 49 | + |
| 50 | +``` |
| 51 | +(head -n 2; tail -n 2) < Mus_musculus.GRCm38.75_chr1.bed |
| 52 | +``` |
| 53 | + |
| 54 | +``` |
| 55 | +1 3054233 3054733 |
| 56 | +1 3054233 3054733 |
| 57 | +1 195240910 195241007 |
| 58 | +1 195240910 195241007 |
| 59 | +``` |
| 60 | + |
| 61 | +Controlling output with `head` (and a pipe): |
| 62 | + |
| 63 | +``` |
| 64 | +grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf | head -n 1 |
| 65 | +``` |
| 66 | + |
| 67 | +``` |
| 68 | +1 protein_coding gene 6206197 6276648 . + . gene_id "ENSMUSG00000025907"; gene_name "Rb1cc1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; |
| 69 | +``` |
| 70 | + |
| 71 | +Note that using `head` speeds things up, as (in the above example), the `grep` command can stop as soon as `head` has one line of output. |
| 72 | + |
| 73 | +It takes longer if you run the command without `head` (use `time` to check): |
| 74 | + |
| 75 | +``` |
| 76 | +time grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf |
| 77 | +``` |
| 78 | + |
| 79 | +``` |
| 80 | +grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf 0.56s user 0.02s system 94% cpu 0.611 total |
| 81 | +``` |
| 82 | + |
| 83 | +With `head`: |
| 84 | + |
| 85 | +``` |
| 86 | +time (grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf | head -n 1) |
| 87 | +``` |
| 88 | + |
| 89 | +``` |
| 90 | +( grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf | head ) 0.01s user 0.01s system 77% cpu 0.021 total |
| 91 | +``` |
| 92 | + |
| 93 | +Definitely good to remember if you are testing things out when you are creating a workflow. |
| 94 | + |
| 95 | +### `less`(and `more`) |
| 96 | + |
| 97 | +`less` and `more` (stick with `less`, it has *more* features) |
| 98 | + |
| 99 | +Interactive program for inspecting files. |
| 100 | + |
| 101 | +``` |
| 102 | +less Mus_musculus.GRCm38.75_chr1.gtf |
| 103 | +``` |
| 104 | + |
| 105 | +Type `q` to exit. |
| 106 | + |
| 107 | +Other shortcuts: |
| 108 | + |
| 109 | +| Shortcut | Action| |
| 110 | +|----------|-------| |
| 111 | +| space | Next page | |
| 112 | +| b | Previous page | |
| 113 | +| g | First line | |
| 114 | +| G | Last line | |
| 115 | +| j | Down (one line) | |
| 116 | +| k | Up (one line) | |
| 117 | +| /<pattern> | Search down (forward) for string <pattern>| |
| 118 | +| ?<pattern> | Search up (backward) for string <pattern> | |
| 119 | +| n | Repeat last search downward (forward) | |
| 120 | +| N | Repeat last search upward (backward) | |
| 121 | + |
| 122 | + |
| 123 | +Try searching for the gene `Sgk3` in the file `Mus_musculus.GRCm38.75_chr1.gtf` using `less`. |
| 124 | + |
| 125 | + |
| 126 | +### Summary information for text files |
| 127 | + |
| 128 | +The `wc` command returns word, line, character and byte counts for a file. |
| 129 | + |
| 130 | +``` |
| 131 | +wc Mus_musculus.GRCm38.75_chr1.bed |
| 132 | +``` |
| 133 | + |
| 134 | +``` |
| 135 | +81226 243678 1698545 Mus_musculus.GRCm38.75_chr1.bed |
| 136 | +``` |
| 137 | + |
| 138 | +Can select which output you want using: |
| 139 | + |
| 140 | + - `-c`: bytes |
| 141 | + - `-l`: lines |
| 142 | + - `-m`: characters |
| 143 | + - `-w`: words |
| 144 | + |
| 145 | +For example, to check how many lines there are in the file `Mus_musculus.GRCm38.75_chr1.bed`: |
| 146 | + |
| 147 | +``` |
| 148 | +wc -l Mus_musculus.GRCm38.75_chr1.bed |
| 149 | +``` |
| 150 | + |
| 151 | +``` |
| 152 | +81226 Mus_musculus.GRCm38.75_chr1.bed |
| 153 | +``` |
| 154 | + |
| 155 | +You can also check multiple files at once: |
| 156 | + |
| 157 | +``` |
| 158 | +wc -l Mus_musculus.GRCm38.75_chr1.bed Mus_musculus.GRCm38.75_chr1.gtf |
| 159 | + 81226 Mus_musculus.GRCm38.75_chr1.bed |
| 160 | + 81231 Mus_musculus.GRCm38.75_chr1.gtf |
| 161 | + 162457 total |
| 162 | +``` |
| 163 | + |
| 164 | +Why are there five lines difference between the files? Use `head` to check: |
| 165 | + |
| 166 | +``` |
| 167 | +head -n 6 Mus_musculus.GRCm38.75_chr1.gtf |
| 168 | +``` |
| 169 | + |
| 170 | +``` |
| 171 | +#!genome-build GRCm38.p2 |
| 172 | +#!genome-version GRCm38 |
| 173 | +#!genome-date 2012-01 |
| 174 | +#!genome-build-accession NCBI:GCA_000001635.4 |
| 175 | +#!genebuild-last-updated 2013-09 |
| 176 | +1 pseudogene gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; |
| 177 | +``` |
| 178 | + |
| 179 | +Not a definitive answer, but `Mus_musculus.GRCm38.75_chr1.gtf` has a five line header. |
| 180 | + |
| 181 | +### Column data |
| 182 | + |
| 183 | +Coming soon... |
| 184 | + |
| 185 | +And then: `grep`, `hexdump`, `sort`, `uniq`, `join`, `awk`, `sed`, regex... |
0 commit comments