Skip to content

PE data:overlap-analysis-based adapter detection and --detect_adapter_for_pe #643

@Pai-Shenglei

Description

@Pai-Shenglei

Hi Dr,
Thank you for this so wonderful tool.
I was using fastp to do adapter trimming only for PE data(T7 & PE150,Detected read1 adapter: AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA,Detected read2 adapter: AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG). The peak of insert fragments is about 140bp because of poor library preparation process. Comparing to Cutadapt, I got a better result by using fastp, which is good for me. But, I have some questions about the statistics.

  1. about the option --detect_adapter_for_pe
# first run, “--length_required 0” just for test
fastp -w 20 --disable_quality_filtering  --length_required 0 -i s1_raw_1.fq -I s1_raw_2.fq -o s1_clean_1.fq -O s1_clean_2.fq --html s1.html --json s1.json
# second run
fastp -w 20 --disable_quality_filtering --detect_adapter_for_pe --length_required 0 -i s1_raw_1.fq -I s1_raw_2.fq -o s1_clean_1.fq -O s1_clean_2.fq --html s1.html --json s1.json

At first run, I did not use the option --detect_adapter_for_pe. And then, I found that some adapter sequence located in reads(neither in read tail nor in read head)can not be deteacted /trimmed. For example.

>seq2_r1
--------------------------------------------------------------------------------------TGAAAGACTGTTTTTCATTGGGGAAGCGTTAAGACGAGGAGTTACTCCACAGGAAATACACGATAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAAGAAGGTGTGACTGATAAGGTCGCCATGCCTCTCAGTACGTCAGCAGTTGCTGAA
                                                                                      ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>seq2_r2 (reverse complement)
TAAGGTCGCCATGCCTCTCAGTACGTCAGCAGTTGCTGAAGACGCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTTGAAAGACTGTTTTTCATTGGGGAAGCGTTAAGACGAGGAGTTACTCCACAGGAAATACACGAT--------------------------------------------------------------------------------------

At second run, I add --detect_adapter_for_pe to my command line. The adapter in the read pair above was detected successfully. That is what I want exactly. But I am confused. On the basis of the overlap-analysis-based adapter detection theory, the adapter should be detected in the first run. Did I misunderstand anything? Could you tell my the reason?
2. about the "adapter or bad ligation of read"
In my data, read1 have total 8,648,589 read with adapter trimmed, in which "other adapter sequences" is 2,640,590(beacuse of the length of library insert fragment). I checked all sequences that have been trimmed at second run. Then I found several sequences show low similarity to the adapter seqence(such "adapter or bad ligation of read1" of my data is in seq.txt . For example:

>seq1_r1                                                                                                  
----------------------------------------------------------------------------------------------------------------------TTGGGGATTTCGCTGGAAGCGGGAATACATATAAAAAGCACACAGCAGCGTTCTGAGAAACTGCTTTCTGATGTTTGCATTCAAGTCAAAAGTTGAACACTCCCTTTCATAGAGCAGTCCTGAAACACCCCTTTTGTAGTATCTGGAACT
                                                                                                                      ||| |||||||| |||||||||||||||||| |
>seq1_r2 (reverse complement)
ATTCTCAGAAACTTGTTTATGCTGTATCTACTCAACTAACAAAGTTGAACCTTTCTTTTGATAGAGCAGTTTTGAAATGGTCTTTTTGTGGAATCTGCAAGTGGATATTTGGCTAGTTTTGAGGATTTCGTTGGAAGCGGGAATTCATACA---------------------------------------------------------------------------------------------------------------------

>seq3_r1
-----------------------------------------------------------------------------------------------------------------------CAAAGAAGTTTCTGAGAATGCTTCTTTCTGGTTTTTATGAGAAGATATATCCTTTTTCACCATAGGACTCAAAGCGCTCGAAATGTCCACTTCCTGGTAGTGAAGAAAGAATGATTCAAACCTGCTCTATGAAAGGAAGTGTTCAACTCC
                                                                                                                       |||| |||||||||||||||||||| ||| 
>seq3_r2 (reverse complement)
CTTTCCAACGAATCCCTGTAAGCTATCCAAATATCCACCTGCAGATTCTACAAAAGGAGTGTTTCCAAAATGCTGTATCAAAACCAAGGTTCAACTCTGTTAGTTGAGGACACACATCACAAATAAGTTTCTGAGAATGCTTCTGTCTAC-----------------------------------------------------------------------------------------------------------------------

Though such sequences is about 33,000(~0.38%=33000/8648589 of all read1 with adapter trimmed), some of them is >100bp long. Is this normal? If I have some misunderstanding, I will feel sorry. But let me know. Thank you very very much.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions