| title | Useful Programs and Unix Basics | ||||
|---|---|---|---|---|---|
| layout | single | ||||
| author | Arun Seetharam | ||||
| author_profile | true | ||||
| header |
|
Bioawk is an extension of the UNIX core utility command awk. It provides several features for biological data manipulation in a similar way as that of awk. This tutorial will give a brief introduction and examples for some common tasks that can be done with this command.
Bioawk is developed by Heng Li. You can download and install it from the Git repository. On Lightning3/Condo, it has already been installed, just load the bioawk module to start using it.
- It can automatically recognize some popular formats and will parse different features associated with those formats. The format option is passed to bioawk using
-carg flag. Hereargcan bebed,sam,vcf,gfforfastx(for bothfastqandFASTA). It can also deal with other types of table formats using the-c headeroption. Whenheaderis specified, the field names will used for variable names, thus greatly expanding the utility. - There are several builtin functions (other than the standard
awkbuilt-ins), that are specific to biological file formats. When a format is read withbioawk, the fields get automatically parsed. You can apply several functions on these variables to get the desired output. Let's say, we readfastaformat, now we have$nameand$seqthat holds sequence name and sequence respectively. You can use theprintfunction (awkbuiltin) to print$nameand$seq. You can also usebioawkbuilt-in with theprintfunction to get length, reverse complement etc by just using'{print length($seq)}'. Other functions includereverse,revcomp,trimq,and,or,xoretc. - It can automatically read gzipped/compressed files
-tto set input and output filed separator as tab-cfmtto read and parse the file in desired format-vvar=value initialize a variable and value [std to awk as well]-Hretain header in the output file (for files like SAM)- And all standard
awkflags will work withbioawk
For the -c you can either specify bed, sam, vcf, gff, fastx or header. Bioawk will parse these variables for the respective format
bed |
sam |
vcf |
gff |
fastx |
|---|---|---|---|---|
| chrom | qname | chrom | seqname | name |
| start | flag | pos | source | seq |
| end | rname | id | feature | qual |
| name | pos | ref | start | comment |
| score | mapq | alt | end | |
| strand | cigar | qual | score | |
| thickstart | rnext | filter | filter | |
| thickend | pnext | info | strand | |
| rgb | tlen | group | ||
| blockcount | seq | attribute | ||
| blocksizes | qual | |||
| blockstarts |
If -c header is specified, the field names (first line) will be used as variables (spaces and special character will be changed to under_score).
Once the input file is read, the defline for the FASTA will be $name variable and the sequence will be $seq variable. you can use any of the standard awk functions on these as well as the bioawk functions. Some_eg.,_
bioawk -c fastx '{ print $name, length($seq) }' input.fasta
bioawk -c fastx '{ print $name, gc($seq) }' input.fasta
bioawk -c fastx '{ print ">"$name;print revcomp($seq) }' input.fasta
bioawk -c fastx 'length($seq) > 100{ print ">"$name; print $seq }' input.fasta
bioawk -c fastx '{ print ">PREFIX"$name; $seq }' input.fasta
bioawk -c fastx '{ print ">"$name"|SUFFIX"; $seq }' input.fasta
bioawk -t -c fastx '{ print $name, $seq }' input.fasta
#for large scale use cdbyank instead
bioawk -cfastx 'BEGIN{while((getline k <"IDs.txt")>0)i[k]=1}{if(i[$name])print ">"$name"\n"$seq}' input.fasta
These are just some examples, we can do many more with other standard awk functions.
Here, the -c fastx option remains same but bioawk will automatically recognize the fastq format and build the required variables, such as $name $seq $qual and $comment
bioawk -t -c fastx 'END {print NR}' input.fastq
# note that when fastq is specified, each record is 4 lines
bioawk -c fastx '{print ">"$name; print $seq}' input.fastq
bioawk -c fastx '{print ">"$name; print meanqual($qual)}' input.fastq
bioawk -cfastx 'length($seq) > 10 {print "@"$name"\n"$seq"\n+\n"$qual}' input.fastq
bioawk -c fastx ' trimq(30, 0, 5){print $0}' input.fastq
# trims fastq bases 0 to 5 (beginning to end), scores less than 30.
bioawk -c bed '{ print $end - $start }' test.bed
bioawk -c sam 'and($flag,4)' input.sam
bioawk -c sam -H '!and($flag,4)' input.sam
bioawk -c sam '{ s=$seq; if(and($flag, 16)) {s=revcomp($seq) } print ">"$qname"\n"s}' input.sam > output.fasta
grep -v ^## in.vcf | bioawk -tc hdr '{print $foo,$bar}'
Will be added soon!
Say, if your input file is as follows:
| name | phone | age | |
|---|---|---|---|
| Joe | 6407 | a@g.com | 24 |
| Doe | 4506 | b@g.com | 26 |
bioawk -t -c header '$age < "25" {print $0}' input.txt