博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Question: Should I use reads with good quality but failed-vendor flag?--biostart for vendor quality
阅读量:6080 次
发布时间:2019-06-20

本文共 5112 字,大约阅读时间需要 17 分钟。

https://www.biostars.org/p/198405/

Quick question is: I have some mapped reads in bam file which have good read quality, but they have sam flag 0x200 which means they didn't pass the vendor check. Should I include them or not in downstream analysis?

Long question is: what' s the relationship between read quality score and Chastity score?

First, everybody may know read quality score:

Reads quality score(phred score) is calculated by -10*log(P(error_base)), P(error_base) represents the probability that the base is incorrect.

Second, I want to talk about Chastity score during the vendor check:

For reads in fastq format, there is a header field 'Y/N' which indicates whether the read pass filtering step. And the corresponding sam flag is 0x200, indicating "not passing filters, such as platform/vendor quality controls". How does Illumina set the filtering criteria?

As far as I know, read filtering by Illumina Real Time Analysis (RTA) happens during the run, and filtering is determined by Chastity score. Chastity Score is calculated by “the ratio of the highest of the four (base type) intensities to the sum of highest two”. Illumina described the vendor check as follows:

"To remove the least reliable data from the analysis, the raw data can be filtered to remove any clusters that have “too much” intensity corresponding to bases other than the called base. By default, the purity of the signal from each cluster is examined over the first 25 cycles and calculated as Chastity = Highest_Intensity / (Highest_Intensity + Next_Highest_Intensity) for each cycle. The new default filtering implemented at the base calling stage allows at most one cycle that is less than the Chastity threshold. The higher the value, the better. This value is very dependent on cluster density, since the major cause of an impure signal in the early cycles is the presence of another cluster within a few micrometers."

So, to my understanding, every cycle the Sequencer scan a cluster, there would be 4 kinds of signals from 4 bases(am I right?) the most significant base would the final choice. The bigger the signal intensity divergence is the better for base calling. For the first 25 cycles, Illumina allow at most one base with smaller signal intensity divergence, otherwise, Illumina would set the read as vendor failed. Is my understanding right so far?

But what is the relationship between the Phred score and the Chastity score? if they really have. Can I still use vendor failed reads if they have high phred score?

Thanks! Tao

• 482 views
ADD COMMENT •  • 
Not following 
modified 10 months ago by  • 1.1k • written 10 months ago by  • 110
 
1

Curious. Why are the vendor failed reads in your dataset?

ADD REPLY • modified 10 months ago • written 10 months ago by  ♦ 26k
 

I downloaded the bam file from GTEx (dbGaP). The bam file contains all the reads, including mapped, unmapped, vendor failed reads. For a sample with ~100M reads, ~12M are labeled as vendor failed including both mapped and unmapped reads. Part of the vendor failed reads have read good quality. So, I'm not sure if I should include them.

ADD REPLY • modified 10 months ago • written 10 months ago by  • 110
 
0
 
10 months ago by
 •
1.1k
Canada

I second this comment. You should contact your vendor. I have never seen reads failing the filtering step indicated in the header field of a FASTQ file being given to a client. Why include these reads? They just take up storage space, and are likely to induce errors in the downstream analysis. There was either an error in the setting of the flag, or a mistake in giving you the reads.

ADD COMMENT • written 10 months ago by  • 1.1k
 
1

I checked the GTEx Project FAQ. The alignment was probably done in 2012, since TopHat v1.4.1 was used. This was the very dawn of RNA-Seq. The analyses dating back to this period are often suspicious since bioinformaticians were not yet familiar with RNA-Seq, and the software programs contained bugs more often than not. My recommendation is always to treat with suspicion any analysis results dating back to this period. Most likely, those preparing the data were not aware yet that these reads should be filtered out.

I would filter out all the "vendor failed reads", and redo the alignment using a more recent aligner, genome, and annotation. At least, that would be my recommendation based on my knowledge. To get a definitive answer, you could contact the staff at the GTex project.

ADD REPLY • written 10 months ago by  • 1.1k
 

Thanks, your comments are very helpful!

ADD REPLY • written 10 months ago by  • 110
 

thanks for your comments. The sample is downloaded from a public project GTEx. I'm also confused why they deposit so many(10M vendor failed for a 100M sample) vendor-failed reads on dbGaP. In my study, I didn't realize this problem at first, which causing a big problem now. In your opinion, such reads should be removed without considering reads quality?

ADD REPLY • written 10 months ago by  • 110
 
1

Short answer yes.

They were "failed" by Illumina pre-processing software for a reason (e.g. mixed sequence from one cluster, phasing issues etc).

ADD REPLY • 

转载地址:http://wfhgx.baihongyu.com/

你可能感兴趣的文章
Linux下基本栈溢出攻击【转】
查看>>
c# 连等算式都在做什么
查看>>
使用c:forEach 控制5个换行
查看>>
java web轻量级开发面试教程摘录,java web面试技巧汇总,如何准备Spring MVC方面的面试...
查看>>
使用ansible工具部署ceph
查看>>
linux系列博文---->深入理解linux启动运行原理(一)
查看>>
Android反编译(一) 之反编译JAVA源码
查看>>
结合当前公司发展情况,技术团队情况,设计一个适合的技术团队绩效考核机制...
查看>>
python-45: opener 的使用
查看>>
cad图纸转换完成的pdf格式模糊应该如何操作?
查看>>
Struts2与Struts1区别
查看>>
网站内容禁止复制解决办法
查看>>
Qt多线程
查看>>
我的友情链接
查看>>
想说一点东西。。。。
查看>>
css知多少(8)——float上篇
查看>>
NLB网路负载均衡管理器详解
查看>>
水平添加滚动条
查看>>
PHP中”单例模式“实例讲解
查看>>
VS2008查看dll导出函数
查看>>