Project description:BackgroundThere are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets.ResultsPharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance.ConclusionsUnderstandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.

Project description:Colorectal neoplasia causes bleeding, enabling detection using Faecal Occult Blood tests (FOBt). The National Health Service (NHS) Bowel Cancer Screening Programme (BCSP) guaiac-based FOBt (gFOBt) kits contain six sample windows (or 'spots') and each kit returns either a positive, unclear or negative result. Test kits with five or six positive windows are termed 'abnormal' and the subject is referred for further investigation, usually colonoscopy. If 1-4 windows are positive, the result is initially 'unclear' and up to two further kits are submitted, further positivity leads to colonoscopy ('weak positive'). If no further blood is detected, the test is deemed 'normal' and subjects are tested again in 2 years' time. We studied the association between spot positivity % (SP%) and neoplasia.Subjects in the Southern Hub completing the first of two consecutive episodes between April 2009 and March 2011 were studied. Each episode included up to three kits and a maximum of 18 windows (spots). For each positivity combination, the percentage of positive spots out of the total number of spots completed by an individual in a single-screening episode was derived and named 'SP%'. Fifty-five combinations of SP can occur if the position of positive/negative spots on the same test card is ignored.The proportion of individuals for whom neoplasia was identified in Episode 2 was derived for each of the 55 spot combinations. In addition, the Episode 1 spot pattern was analysed for subjects with cancer detected in Episode 2.During Episode 2, 284,261 subjects completed gFOBT screening and colonoscopies were performed on 3891 (1.4%) subjects. At colonoscopy, cancer was detected in 7.4% (n=286) and a further 39.8% (n=1550) had adenomas. Cancer was detected in 21.3% of subjects with an abnormal first kit (five or six positive spots) and in 5.9% of those with a weak positive test result.The proportion of cancers detected was positively correlated with SP%, with an R(2) correlation (linear) of 0.89. As the SP% increased from 11 to 100%, so the colorectal cancer (CRC) detection rate increased from 4 to 25%. At the lower SP%s, from 11to 25%, the CRC risk was relatively static at ~4%. Above an SP% of 25%, every 10-percentage points increase in the SP%, was associated with an increase in cancer detection of 2.5%.This study demonstrated a strong correlation between SP% and cancer detection within the NHS BCSP. At the population level, subjects' cancer risk ranged from 4 to 25% and correlated with the gFOBt spot pattern.Some subjects with an SP% of 11% proceed to colonoscopy, whereas others with an SP% of 22% do not. Colonoscopy on patients with four positive spots in kit 1 (SP% 22%) would, we estimate, detect cancer in ~4% of cases and increase overall colonoscopy volume by 6%. This study also demonstrated how screening programme data could be used to guide its ongoing implementation and inform other programmes.

Dataset Information

Miscellaneous screening data

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets