Skip to content

Commit e1a5122

Browse files
author
mikblack
committed
Progress on Unix Data Tools
1 parent 5d9e2f7 commit e1a5122

3 files changed

Lines changed: 162640 additions & 0 deletions

File tree

4.UnixDataTools/README.md

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,185 @@
11
## Unix Data Tools
22

3+
### `head` and `tail` for inspecting text files
4+
5+
Use `head` to view the first few lines (default is 10 lines) of the file `Mus_musculus.GRCm38.75_chr1.bed`
6+
7+
```
8+
head Mus_musculus.GRCm38.75_chr1.bed
9+
```
10+
11+
```
12+
1 3054233 3054733
13+
1 3054233 3054733
14+
1 3054233 3054733
15+
1 3102016 3102125
16+
1 3102016 3102125
17+
1 3102016 3102125
18+
1 3205901 3671498
19+
1 3205901 3216344
20+
1 3213609 3216344
21+
1 3205901 3207317
22+
```
23+
24+
We can choose the number of lines to display using the `-n` switch:
25+
26+
```
27+
head -n 3 Mus_musculus.GRCm38.75_chr1.bed
28+
```
29+
30+
```
31+
1 3054233 3054733
32+
1 3054233 3054733
33+
1 3054233 3054733
34+
```
35+
36+
Use `tail` to view the last three lines of the file `Mus_musculus.GRCm38.75_chr1.bed`:
37+
38+
```
39+
tail -n 3 Mus_musculus.GRCm38.75_chr1.bed
40+
```
41+
42+
```
43+
1 195240910 195241007
44+
1 195240910 195241007
45+
1 195240910 195241007
46+
```
47+
48+
View the start *and* the end of a file with a single command:
49+
50+
```
51+
(head -n 2; tail -n 2) < Mus_musculus.GRCm38.75_chr1.bed
52+
```
53+
54+
```
55+
1 3054233 3054733
56+
1 3054233 3054733
57+
1 195240910 195241007
58+
1 195240910 195241007
59+
```
60+
61+
Controlling output with `head` (and a pipe):
62+
63+
```
64+
grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf | head -n 1
65+
```
66+
67+
```
68+
1 protein_coding gene 6206197 6276648 . + . gene_id "ENSMUSG00000025907"; gene_name "Rb1cc1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
69+
```
70+
71+
Note that using `head` speeds things up, as (in the above example), the `grep` command can stop as soon as `head` has one line of output.
72+
73+
It takes longer if you run the command without `head` (use `time` to check):
74+
75+
```
76+
time grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf
77+
```
78+
79+
```
80+
grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf 0.56s user 0.02s system 94% cpu 0.611 total
81+
```
82+
83+
With `head`:
84+
85+
```
86+
time (grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf | head -n 1)
87+
```
88+
89+
```
90+
( grep 'gene_id "ENSMUSG00000025907"' Mus_musculus.GRCm38.75_chr1.gtf | head ) 0.01s user 0.01s system 77% cpu 0.021 total
91+
```
92+
93+
Definitely good to remember if you are testing things out when you are creating a workflow.
94+
95+
### `less`(and `more`)
96+
97+
`less` and `more` (stick with `less`, it has *more* features)
98+
99+
Interactive program for inspecting files.
100+
101+
```
102+
less Mus_musculus.GRCm38.75_chr1.gtf
103+
```
104+
105+
Type `q` to exit.
106+
107+
Other shortcuts:
108+
109+
| Shortcut | Action|
110+
|----------|-------|
111+
| space | Next page |
112+
| b | Previous page |
113+
| g | First line |
114+
| G | Last line |
115+
| j | Down (one line) |
116+
| k | Up (one line) |
117+
| /<pattern> | Search down (forward) for string <pattern>|
118+
| ?<pattern> | Search up (backward) for string <pattern> |
119+
| n | Repeat last search downward (forward) |
120+
| N | Repeat last search upward (backward) |
121+
122+
123+
Try searching for the gene `Sgk3` in the file `Mus_musculus.GRCm38.75_chr1.gtf` using `less`.
124+
125+
126+
### Summary information for text files
127+
128+
The `wc` command returns word, line, character and byte counts for a file.
129+
130+
```
131+
wc Mus_musculus.GRCm38.75_chr1.bed
132+
```
133+
134+
```
135+
81226 243678 1698545 Mus_musculus.GRCm38.75_chr1.bed
136+
```
137+
138+
Can select which output you want using:
139+
140+
- `-c`: bytes
141+
- `-l`: lines
142+
- `-m`: characters
143+
- `-w`: words
144+
145+
For example, to check how many lines there are in the file `Mus_musculus.GRCm38.75_chr1.bed`:
146+
147+
```
148+
wc -l Mus_musculus.GRCm38.75_chr1.bed
149+
```
150+
151+
```
152+
81226 Mus_musculus.GRCm38.75_chr1.bed
153+
```
154+
155+
You can also check multiple files at once:
156+
157+
```
158+
wc -l Mus_musculus.GRCm38.75_chr1.bed Mus_musculus.GRCm38.75_chr1.gtf
159+
81226 Mus_musculus.GRCm38.75_chr1.bed
160+
81231 Mus_musculus.GRCm38.75_chr1.gtf
161+
162457 total
162+
```
163+
164+
Why are there five lines difference between the files? Use `head` to check:
165+
166+
```
167+
head -n 6 Mus_musculus.GRCm38.75_chr1.gtf
168+
```
169+
170+
```
171+
#!genome-build GRCm38.p2
172+
#!genome-version GRCm38
173+
#!genome-date 2012-01
174+
#!genome-build-accession NCBI:GCA_000001635.4
175+
#!genebuild-last-updated 2013-09
176+
1 pseudogene gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";
177+
```
178+
179+
Not a definitive answer, but `Mus_musculus.GRCm38.75_chr1.gtf` has a five line header.
180+
181+
### Column data
182+
183+
Coming soon...
184+
185+
And then: `grep`, `hexdump`, `sort`, `uniq`, `join`, `awk`, `sed`, regex...

0 commit comments

Comments
 (0)