Dataset Information

BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.

ABSTRACT: Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.

SUBMITTER: Chen Q

PROVIDER: S-EPMC6301335 | biostudies-other | 2018 Jan

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.

Chen Qingyu Q Panyam Nagesh C NC Elangovan Aparna A Verspoor Karin K

Database : the journal of biological databases and curation 20180101

Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically ...[more]

PMID: 30576491

Similar Datasets

Project description:The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.

Project description:BackgroundDue to various factors such as the increasing aging of the population and the upgrading of people's health consumption needs, the demand group for rehabilitation medical care is expanding. Currently, China's rehabilitation medical care encounters several challenges, such as inadequate awareness and a scarcity of skilled professionals. Enhancing public awareness about rehabilitation and improving the quality of rehabilitation services are particularly crucial. Named entity recognition is an essential first step in information processing as it enables the automated extraction of rehabilitation medical entities. These entities play a crucial role in subsequent tasks, including information decision systems and the construction of medical knowledge graphs.MethodsIn order to accomplish this objective, we construct the BERT-Span model to complete the Chinese rehabilitation medicine named entity recognition task. First, we collect rehabilitation information from multiple sources to build a corpus in the field of rehabilitation medicine, and fine-tune Bidirectional Encoder Representation from Transformers (BERT) with the rehabilitation medicine corpus. For the rehabilitation medicine corpus, we use BERT to extract the feature vectors of rehabilitation medicine entities in the text, and use the span model to complete the annotation of rehabilitation medicine entities.ResultCompared to existing baseline models, our model achieved the highest F1 value for the named entity recognition task in the rehabilitation medicine corpus. The experimental results demonstrate that our method outperforms in recognizing both long medical entities and nested medical entities in rehabilitation medical texts.ConclusionThe BERT-Span model can effectively identify and extract entity knowledge in the field of rehabilitation medicine in China, which supports the construction of the knowledge graph of rehabilitation medicine and the development of the decision-making system of rehabilitation medicine.

Project description:Project Description Esophagitis is a frequent, but at the molecular level poorly characterized condition with diverse underlying etiologies and treatments. Correct diagnosis can be challenging due to partially overlapping histological features. By proteomic profiling of 55 biopsy specimens representing controls, Reflux- (GERD), Eosinophilic-(EoE), Crohns-(CD), and Herpes simplex (HSV)-esophagitis, as well as Candida albicans infection by LC-MS/MS, we identified distinct signatures and functional networks. Our integrated AI-assisted morphoproteomic approach allows deeper insights in disease-specific molecular alterations and represents a promising tool in esophagitis-related precision medicine. The FFPE samples were further processed including macrodissection, protein extraction, protein precipitation, protein digestion, and peptide clean up according to a previously published and modified protocol (Buczak et al., 2020). For further details, please refer to the Materials and Methods section of the manuscript. In depth proteomic characterization of the samples a label free high-resolution LC-MS/MS approach using an Orbitrap Tribrid Fusion mass spectrometer was chosen (operated in DIA mode). Tryptic peptides were loaded onto a µPAC Trapping Column with a pillar diameter of 5 µm, inter-pillar distance of 2.5 µm, pillar length/bed depth of 18 µm, external porosity of 9%, bed channel width of 2 mm and length of 10 mm; pillars are superficially porous with a porous shell thickness of 300 nm and pore sizes in the order of 100 to 200 Å at a flow rate of 10 µl per min in 0.1% trifluoroacetic acid in HPLC-grade water. Peptides were eluted and separated on the PharmaFluidics µPAC nano-LC column: 50 cm µPAC C18 with a pillar diameter of 5 µm, inter-pillar distance of 2.5 µm, pillar length/bed depth of 18 µm, external porosity of 59%, bed channel width of 315 µm and bed length of 50 cm; pillars are superficially porous with a porous shell thickness of 300 nm and pore sizes in the order of 100 to 200 Å by a linear gradient from 2% to 30 % of buffer B (80% acetonitrile and 0.08% formic acid in HPLC-grade water) in buffer A (2% acetonitrile and 0.1% formic acid in HPLC-grade water) at a flow rate of 300 nl per min. The remaining peptides were eluted by a short gradient of 10 minutes from 30% to 95% buffer B; followed by 25 minutes at 2% of buffer B, the total gradient run was 120 min. Spectra were acquired in DIA mode using 50 variable-width windows over the mass range 350-1500 m/z. The Orbitrap was used for MS1 and MS2 detection, with an AGC target for MS1 set to 20x104 and a maximum injection time to 100 ms. MS2 scan range was set between 200 and 2000 m/z, with a minimum of 6 points across the peak. Orbitrap resolution for MS2 was set to 30K, isolation window set to 1.6, AGC target to 50x104 and maximum injection time to 54 ms. MS1 and MS2 data were acquired in centroid mode. In order to check for retention time (RT) stability, iRT standards (Biognosys) were spiked in each sample according to the manufacturer recommendations, the 11 iRT peptide sequences were manually added to the database and used during DIA-NN search to generate the precursor ion library used for MS data analysis. To reduce the possibility of carry over and cross contamination between the samples, two BSA washes were used between samples, and a trap column wash followed by 2 BSA washes was used every 10 samples sequence. The above-mentioned workflow is schematically displayed in Fig. 1B. LC-MS/MS was performed at the Proteome Center Tuebingen (PCT). Here, 250 ng of peptides were loaded onto an Easy-nLC 1200 system coupled to a quadrupole Orbitrap Exploris 480 mass spectrometer (all Thermo Fisher Scientific, Waltham, MA, USA) as previously described (Krauss et al, 2023). MS raw data files were analyzed using DIA-NN 1.8.1 (Demichev et al, 2020) in library-free mode against the human database (UniProt release March 2024, 20412 proteins). First, a precursor ion library was generated using FASTA digest for library-free search in combination with deep learning-based spectra prediction. An experimental library generated from the DIA-NN search was used for cross-run normalization and mass accuracy correction. Only high-accuracy spectra with a minimum precursor FDR of 0.01, and only tryptic peptides (2 missed Tryptic cleavages) were used for protein quantification. The match between runs option was activated and no shared spectra were used for protein identification. Similarly, Normal, HSV, and Candida samples were searched against reviewed entries of HSV1 (taxonomy id 10298, 125 entries), HSV2 (taxonomy id 10310, 95 entries), and C. albicans (taxonomy id 5476, 1412 entries) downloaded on 04.03.2024, in addition to the human database.

Dataset Information

BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.

Publications

BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets