Project description:This repository contains all the FASTQ files for the five data modalities (scRNA-seq, scATAC-seq, Multiome, CITE-seq+scVDJ-seq, and spatial transcriptomics) used in the article \\"An Atlas of Cells in The Human Tonsil,\\" published in Immunity in 2024. Inspired by the TCGA barcodes, we have named each fastq file with the following convention: [TECHNOLOGY].[DONOR_ID].[SUBPROJECT].[GEM_ID].[LIBRARY_ID].[LIBRARY_TYPE].[LANE].[READ].fastq.gz which allows to retrieve all metadata from the name itself. Here is a full description of each field: - TECHNOLOGY: scRNA-seq, scATAC-seq, Multiome, CITE-seq+scVDJ-seq, and spatial transcriptomics (Visium). We also include the fastq files associated with the multiome experiments performed on two mantle cell lymphoma patients (MCL). - DONOR_ID: identifier for each of the 17 patients included in the cohort. We provide the donor-level metadata in the file \\"tonsil_atlas_donor_metadata.csv\\", including the hospital, sex, age, age group, cause for tonsillectomy and cohort type for every donor. - SUBPROJECT: each subproject corresponds to one run of the 10x Genomics Chromium™ Chip. - GEM_ID: each run of the 10x Genomics Chromium™ Chip consists of up to 8 \\"GEM wells\\" (see https://www.10xgenomics.com/support/software/cell-ranger/getting-started/cr-glossary): a set of partitioned cells (Gel Beads-in-emulsion) from a single 10x Genomics Chromium™ Chip channel. We give a unique identifier to each of these channels. - LIBRARY_ID: one or more sequencing libraries can be derived from a GEM well. For instance, multiome yields two libraries (ATAC and RNA) and CITE-seq+scVDJ yields 4 libraries (RNA, ADT, BCR, TCR). - LIBRARY_TYPE: the type of library for each library_id. Note that we used cell hashing () for a subset of the scRNA-seq libraries, and thus the library_type can be \\"not_hashed\\", \\"hashed_cdna\\" (RNA expression) or \\"hashed_hto\\" (the hashtag oligonucleotides). - LANE: to increase sequencing depth, each library was sequenced in more than one lane. Important: all lanes corresponding to the same sequencing library need to be inputed together to cellranger, because they come from the same set of cells. - READ: for scATAC-seq we have three reads (R1, R2 or R3), see cellranger-atac's documentation. While we find these names to be the most useful, they need to be changed to follow cellranger's conventions. We provide a code snippet in the README file of the GitHub repository associated with the tonsil atlas to convert between both formats (https://github.com/Single-Cell-Genomics-Group-CNAG-CRG/TonsilAtlas/). Besides the fastq files, cellranger (and other mappers) require additional files, which we also provide in this repository: - cell_hashing_metadata.csv: as mentioned above, we ran cell hashing (10.1186/s13059-018-1603-1) to detect doublets and reduce cost per cell. This file provides the sequence of the hashtag oligonucleotides in cellranger convention to allow demultiplexing. - cite_seq_feature_reference.csv: similar to the previous file, this one links each protein surface marker to the hashtag oligonucleotide that identified it in the CITE-seq experiment. - V10M16-059.gpr and V19S23-039.gpr: these correspond to the two slides of the two Visium experiments performed in the tonsil atlas. They are needed to run spaceranger. - [GEM_ID]_[SLIDE]_[CAPTURE_AREA].jpg: 8 images associated with the Visium experiments. Here, GEM_ID refers to each of the 4 capture areas in each slide. - [TECHNOLOGY]_sequencing_metadata.csv: the GEM-level metadata for each technology. It includes the relationship between subproject, gem_id, library_id, library_type and donor_id. These are the other repositories associated with the tonsil atlas: - Expression and accessibility matrices: https://zenodo.org/records/10373041 - Seurat objects: https://zenodo.org/records/8373756 - HCATonsilData package: https://bioconductor.org/packages/release/data/experiment/html/HCATonsilData.html - Azimuth: https://azimuth.hubmapconsortium.org/ - Github: https://github.com/Single-Cell-Genomics-Group-CNAG-CRG/TonsilAtlas
Project description:BackgroundThe increasingly widespread use of next generation sequencing protocols has brought the need for the development of user-friendly raw data processing tools. Here, we explore 2FAST2Q, a versatile and intuitive standalone program capable of extracting and counting feature occurrences in FASTQ files. Despite 2FAST2Q being previously described as part of a CRISPRi-seq analysis pipeline, in here we further elaborate on the program's functionality, and its broader applicability and functions.Methods2FAST2Q is built in Python, with published standalone executables in Windows MS, MacOS, and Linux. It has a familiar user interface, and uses an advanced custom sequence searching algorithm.ResultsUsing published CRISPRi datasets in which Escherichia coli and Mycobacterium tuberculosis gene essentiality, as well as host-cell sensitivity towards SARS-CoV2 infectivity were tested, we demonstrate that 2FAST2Q efficiently recapitulates published output in read counts per provided feature. We further show that 2FAST2Q can be used in any experimental setup that requires feature extraction from raw reads, being able to quickly handle Hamming distance based mismatch alignments, nucleotide wise Phred score filtering, custom read trimming, and sequence searching within a single program. Moreover, we exemplify how different FASTQ read filtering parameters impact downstream analysis, and suggest a default usage protocol. 2FAST2Q is easier to use and faster than currently available tools, efficiently processing not only CRISPRi-seq / random-barcode sequencing datasets on any up-to-date laptop, but also handling the advanced extraction of de novo features from FASTQ files. We expect that 2FAST2Q will not only be useful for people working in microbiology but also for other fields in which amplicon sequencing data is generated. 2FAST2Q is available as an executable file for all current operating systems without installation and as a Python3 module on the PyPI repository (available at https://veeninglab.com/2fast2q).
| S-EPMC9615965 | biostudies-literature
Project description:Metagenomics sequencing raw fastq files
Project description:We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees.
Project description:BackgroundHigh-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate those type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for processing efficiently very large files, likely in the order of terabytes in the following years, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node.ResultsOur approach, BigSeqKit, takes advantage of a high-performance computing-Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases, it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line.ConclusionsBigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at https://github.com/citiususc/BigSeqKit.