PE data：overlap-analysis-based adapter detection and --detect_adapter_for_pe

Hi Dr,
       Thank you for this so wonderful tool. 
       I was using fastp to do adapter trimming only for PE data（T7 & PE150，Detected read1 adapter: AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA，Detected read2 adapter: AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG）. The peak of insert fragments is about 140bp because of poor library preparation process. Comparing to Cutadapt, I got a better result by using fastp, which is good for me. But, I have some questions about the statistics.
1. about the option --detect_adapter_for_pe
```
# first run, “--length_required 0” just for test
fastp -w 20 --disable_quality_filtering  --length_required 0 -i s1_raw_1.fq -I s1_raw_2.fq -o s1_clean_1.fq -O s1_clean_2.fq --html s1.html --json s1.json
# second run
fastp -w 20 --disable_quality_filtering --detect_adapter_for_pe --length_required 0 -i s1_raw_1.fq -I s1_raw_2.fq -o s1_clean_1.fq -O s1_clean_2.fq --html s1.html --json s1.json
```
At first run, I did not use the option --detect_adapter_for_pe. And then, I found that some adapter sequence located in reads（neither in read tail nor in read head）can not be deteacted /trimmed. For example.
```
>seq2_r1
--------------------------------------------------------------------------------------TGAAAGACTGTTTTTCATTGGGGAAGCGTTAAGACGAGGAGTTACTCCACAGGAAATACACGATAAGTCGGAGGCCAAGCGGTCTTAGGAAGACAAGAAGGTGTGACTGATAAGGTCGCCATGCCTCTCAGTACGTCAGCAGTTGCTGAA
                                                                                      ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>seq2_r2 (reverse complement)
TAAGGTCGCCATGCCTCTCAGTACGTCAGCAGTTGCTGAAGACGCAACTCCTTGGCTCACAGAACGACATGGCTACGATCCGACTTTGAAAGACTGTTTTTCATTGGGGAAGCGTTAAGACGAGGAGTTACTCCACAGGAAATACACGAT--------------------------------------------------------------------------------------
```
At second run, I add --detect_adapter_for_pe to my command line. The adapter in the read pair above was detected successfully. That is what I want exactly. But I am confused. On the basis of the overlap-analysis-based adapter detection theory, the adapter should be detected in the first run. Did I misunderstand anything? Could you tell my the reason?
2. about the "adapter or bad ligation of read"
In my data, read1 have total 8,648,589 read with adapter trimmed, in which "other adapter sequences" is 2,640,590(beacuse of the length of library insert fragment).  I checked all sequences that have been trimmed at second run. Then I found several sequences show low similarity to the adapter seqence(such "adapter or bad ligation of read1" of my data is in [seq.txt](https://github.com/user-attachments/files/23878005/seq.txt) . For example:
```
>seq1_r1                                                                                                  
----------------------------------------------------------------------------------------------------------------------TTGGGGATTTCGCTGGAAGCGGGAATACATATAAAAAGCACACAGCAGCGTTCTGAGAAACTGCTTTCTGATGTTTGCATTCAAGTCAAAAGTTGAACACTCCCTTTCATAGAGCAGTCCTGAAACACCCCTTTTGTAGTATCTGGAACT
                                                                                                                      ||| |||||||| |||||||||||||||||| |
>seq1_r2 (reverse complement)
ATTCTCAGAAACTTGTTTATGCTGTATCTACTCAACTAACAAAGTTGAACCTTTCTTTTGATAGAGCAGTTTTGAAATGGTCTTTTTGTGGAATCTGCAAGTGGATATTTGGCTAGTTTTGAGGATTTCGTTGGAAGCGGGAATTCATACA---------------------------------------------------------------------------------------------------------------------

>seq3_r1
-----------------------------------------------------------------------------------------------------------------------CAAAGAAGTTTCTGAGAATGCTTCTTTCTGGTTTTTATGAGAAGATATATCCTTTTTCACCATAGGACTCAAAGCGCTCGAAATGTCCACTTCCTGGTAGTGAAGAAAGAATGATTCAAACCTGCTCTATGAAAGGAAGTGTTCAACTCC
                                                                                                                       |||| |||||||||||||||||||| ||| 
>seq3_r2 (reverse complement)
CTTTCCAACGAATCCCTGTAAGCTATCCAAATATCCACCTGCAGATTCTACAAAAGGAGTGTTTCCAAAATGCTGTATCAAAACCAAGGTTCAACTCTGTTAGTTGAGGACACACATCACAAATAAGTTTCTGAGAATGCTTCTGTCTAC-----------------------------------------------------------------------------------------------------------------------
```
Though such sequences is about 33,000(~0.38%=33000/8648589 of all read1 with adapter trimmed), some of them is >100bp long. Is this normal? If I have some misunderstanding, I will feel sorry. But let me know. Thank you very very much.

<img width="603" height="908" alt="Image" src="https://github.com/user-attachments/assets/891fbc84-fc25-4518-9f66-31bca800652a" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PE data：overlap-analysis-based adapter detection and --detect_adapter_for_pe #643

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PE data：overlap-analysis-based adapter detection and --detect_adapter_for_pe #643

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions