Fact or Fiction – Noise Filtering

Distinguishing between fact or fiction is not only hard after a bottle of wine but also when looking at the short read pileups and turning our head around day and night while analysing NGS data. When deciding upon a block of short reads if they are representing a valid sequence/variant/whatsoever or they are just pure artefacts the situation looks similar to watching a tricky image presenting an illusion where you have to figure out what is misleading your eyes.

elephant1

In practice noise is responsible for most of the necessary manual investigations: if only valuable data would be present in sequencing data probably algorithms would take the place over many human eyes. Noise can be practically anything which appears in the data to be analysed and not present in the original patient DNA: random sequencing errors, PCR crossover artefacts,  systematic enzyme errors, not or improperly trimmed adaptors, primers and so on.

If we ask our friend Google what is noise probably the following definition fits into the picture the best: ‘irrelevant or meaningless data or output occurring along with desired information’. This describes the nature of our enemy pretty well: ‘irrelevant’ translates to ‘sequences not part of the original sample DNA’ and ‘meaningless’ refers to ‘sequences being inconsistent in some form’. These attributes usually help us with the data investigation in case of both automatic and manual analysis. The configuration of experimentally refined noise properties is core element of filtering algorithms, also seasoned analysts often take use of similar checklists during sample analysis.

Inconsistency might appear in many different forms. The simplest and easiest to handle is random noise because in that case inconsistency could be identified relatively cheaply. Isolation of the problematic base or bases is straightforward due to the fact that no or not many short reads present the same error.

When inconsistency occurs on a bit larger scale – let’s say we are looking at a block of 100 short reads – more powerful and also more expensive tools are necessary for the analysis. These methods depend on external information to make decisions. One group of additional knowledge is well-known, technology related properties such as strand bias, homopolymer error profiles and various imbalance types. Another group of external input is formed by rules which enable inference mechanisms for detecting noise, like in case of diploid samples when a signal seems to be the mixture of another two then the suspicion of PCR crossover artefacts being present arises.

At least a third group of inconsistency cases should be mentioned as well, which we could call ‘hidden noise’. In some situations no rules and attributes can help with the identification of noise. Typical example of this is when in the investigated region the sample is homozygous except a single weak variant which might be either the result of a high imbalance within a heterozygous sample or also might be simply some artefact which is not present in the patient DNA. In homozygous regions the lack of the additional information provided by the two separate phased signals makes the puzzle more complicated. For such scenarios often a repeated experiment or the usage of an alternative technology is the only reliable solution.

Regardless of the noise filtering methods that have been applied it is certain that at the end we are expecting something which is clear, consistent and matches our best knowledge about both the applied technology and the sample. If we see at the end that there is no contradiction in the output and we have consumed most of the input and organized it into a consistent picture then probably we have the right answer in our hands.

elephant2

Testing the quality of the output is less complex compared to achieving the good results. To build robust pipelines it is really important to take use of this general rule and add automatized quality control mechanisms to the system. They should simplify the job and reduce the hands-on time necessary for the data analysis – let people easily judge if the overall picture is okay or not.

It’s worth the effort to take a note about the general NGS promise, which is like ‘highly redundant measurement combined with high base quality implies that noise is not an issue and human investigation is not necessary’. In fact this applies only to some noise types – mostly related to sequencing – and still there is a large scale of issues to be handled in the pipelines. On the other hand without this level of redundancy algorithms would be stuck and such efficient noise filtering and level of automation wouldn’t be possible.

By György Horváth