Determining soil particle-size distribution from infrared spectra using machine learning predictions: Methodology and modeling.
Ontology highlight
ABSTRACT: Accuracy of infrared (IR) models to measure soil particle-size distribution (PSD) depends on soil preparation, methodology (sedimentation, laser), settling times and relevant soil features. Compositional soil data may require log ratio (ilr) transformation to avoid numerical biases. Machine learning can relate numerous independent variables that may impact on NIR spectra to assess particle-size distribution. Our objective was to reach high IRS prediction accuracy across a large range of PSD methods and soil properties. A total of 1298 soil samples from eastern Canada were IR-scanned. Spectra were processed by Stochastic Gradient Boosting (SGB) to predict sand, silt, clay and carbon. Slope and intercept of the log-log relationships between settling time and suspension density function (SDF) (R2 = 0.84-0.92) performed similarly to NIR spectra using either ilr-transformed (R2 = 0.81-0.93) or raw percentages (R2 = 0.76-0.94). Settling times of 0.67-min and 2-h were the most accurate for NIR predictions (R2 = 0.49-0.79). The NIR prediction of sand sieving method (R2 = 0.66) was more accurate than sedimentation method(R2 = 0.53). The NIR 2X gain was less accurate (R2 = 0.69-0.92) than 4X (R2 = 0.87-0.95). The MIR (R2 = 0.45-0.80) performed better than NIR (R2 = 0.40-0.71) spectra. Adding soil carbon, reconstituted bulk density, pH, red-green-blue color, oxalate and Mehlich3 extracts returned R2 value of 0.86-0.91 for texture prediction. In addition to slope and intercept of the SDF, 4X gain, method and pre-treatment classes, soil carbon and color appeared to be promising features for routine SGB-processed NIR particle-size analysis. Machine learning methods support cost-effective soil texture NIR analysis.
Project description:The use of wastewater irrigation for food crops can lead to presence of bioavailable phthalic acid esters (PAEs) in soils, which increase the potential for human exposure and adverse carcinogenic and non-cancer health effects. This study presents the first investigation of the occurrence and distribution of PAEs in a maize-wheat double-cropping system in a wastewater-irrigated area in the North China Plain. PAE levels in maize and wheat were found to be mainly attributed to PAE stores in soil coarse (250-2000??m) and fine sand (53-250??m) fractions. Soil particle-size fractions with higher bioavailability (i.e., coarse and fine sands) showed greater influence on PAE congener bioconcentration factors compared to PAE molecular structures for both maize and wheat tissues. More PAEs were allocated to maize and wheat grains with increased soil PAE storages from wastewater irrigation. Additional findings showed that levels of both non-cancer and carcinogenic risk for PAE congeners in wheat were higher than those in maize, suggesting that wheat food security should be prioritized. In conclusion, increased soil PAE concentrations specifically in maize and wheat grains indicate that wastewater irrigation can pose a contamination threat to food resources.
Project description:Mathematical descriptions of classical particle size distribution (PSD) data are often used to estimate soil hydraulic properties. Laser diffraction methods (LDM) now provide more detailed PSD measurements, but deriving a function to characterize the entire range of particle sizes is a major challenge. The aim of this study was to compare the performance of eighteen PSD functions for fitting LDM data sets from a wide range of soil textures. These models include five lognormal models, five logistic models, four van Genuchten models, two Fredlund models, a logarithmic model, and an Andersson model. The fits were evaluated using Akaike's information criterion (AIC), adjusted R2, and root-mean-square error (RMSE). The results indicated that the Fredlund models (FRED3 and FRED4) had the best performance for most of the soils studied, followed by one logistic growth function extension model (MLOG3) and three lognormal models (ONLG3, ORLG3, and SHCA3). The performance of most PSD models was better for soils with higher silt content and poorer for soils with higher clay and sand content. The FRED4 model best described the PSD of clay, silty clay, clay loam, silty clay loam, silty loam, loam, and sandy loam, whereas FRED3, MLOG3, ONLG3, ORLG3, and SHCA3 showed better performance for most soils studied.
Project description:Machine learning has emerged as an invaluable tool in many research areas. In the present work, we harness this power to predict highly accurate molecular infrared spectra with unprecedented computational efficiency. To account for vibrational anharmonic and dynamical effects - typically neglected by conventional quantum chemistry approaches - we base our machine learning strategy on ab initio molecular dynamics simulations. While these simulations are usually extremely time consuming even for small molecules, we overcome these limitations by leveraging the power of a variety of machine learning techniques, not only accelerating simulations by several orders of magnitude, but also greatly extending the size of systems that can be treated. To this end, we develop a molecular dipole moment model based on environment dependent neural network charges and combine it with the neural network potential approach of Behler and Parrinello. Contrary to the prevalent big data philosophy, we are able to obtain very accurate machine learning models for the prediction of infrared spectra based on only a few hundreds of electronic structure reference points. This is made possible through the use of molecular forces during neural network potential training and the introduction of a fully automated sampling scheme. We demonstrate the power of our machine learning approach by applying it to model the infrared spectra of a methanol molecule, n-alkanes containing up to 200 atoms and the protonated alanine tripeptide, which at the same time represents the first application of machine learning techniques to simulate the dynamics of a peptide. In all of these case studies we find an excellent agreement between the infrared spectra predicted via machine learning models and the respective theoretical and experimental spectra.
Project description:Particulate phosphorus (PP) is often the largest component of the total phosphorus (P) load in stormwater. Fine-resolution measurement of particle sizes allows us to investigate the mechanisms behind the removal of PP in stormwater wetlands, since the diameter of particles influences the settling velocity and the amount of sorbed P on a particle. In this paper, we present a novel method to estimate PP, where we measure and count individual particles in stormwater and use the total surface area as a proxy for PP. Our results show a strong relationship between total particle surface area and PP, which we use to put forth a simple mechanistic model of PP removal via gravitational settling of individual mineral particles, based on a continuous particle size distribution. This information can help improve the design of stormwater Best management practices to reduce PP loading in both urban and agricultural watersheds.
Project description:The characteristic of particle size distribution (PSD) in the newly formed wetlands in coast has seldom been studied. We applied fractal-scaling theory in assessing soil particle size distribution (PSD) features of newly formed wetlands in the Yellow River Delta (YRD), China. The singular fractal dimensions (D) values ranged from 1.82 to 1.90, the capacity dimension (D0) values ranged from 0.84 to 0.93, and the entropy dimension (D1) values ranged from 0.66 to 0.84. Constrained corresponding analysis revealed that 43.5% of the variance in soil PSD can be explained by environmental factors, including 14.7% by seasonal variation, 8.6% by soil depth, and 8.0% by vegetation type. The fractal dimensions D and D1 were sensitive with fine particles with size ranging less than 126 μm, and D0 was sensitive with coarse particles with size ranging between 126 μm to 2000 μm. Fractal analysis makes full use of soil PSD information, and offers a useful approach to quantify and assess the soil physical attributes in the newly formed wetland.
Project description:Lipoprotein profiling of human blood by 1H nuclear magnetic resonance (NMR) spectroscopy is a rapid and promising approach to monitor health and disease states in medicine and nutrition. However, lack of standardization of measurement protocols has prevented the use of NMR-based lipoprotein profiling in metastudies. In this study, a standardized NMR measurement protocol was applied in a ring test performed across three different laboratories in Europe on plasma and serum samples from 28 individuals. Data was evaluated in terms of (i) spectral differences, (ii) differences in LPD predictions obtained using an existing prediction model, and (iii) agreement of predictions with cholesterol concentrations in high- and low-density lipoproteins (HDL and LDL) particles measured by standardized clinical assays. ANOVA-simultaneous component analysis (ASCA) of the ring test spectral ensemble that contains methylene and methyl peaks (1.4-0.6 ppm) showed that 97.99% of the variance in the data is related to subject, 1.62% to sample type (serum or plasma), and 0.39% to laboratory. This interlaboratory variation is in fact smaller than the maximum acceptable intralaboratory variation on quality control samples. It is also shown that the reproducibility between laboratories is good enough for the LPD predictions to be exchangeable when the standardized NMR measurement protocol is followed. With the successful implementation of this protocol, which results in reproducible prediction of lipoprotein distributions across laboratories, a step is taken toward bringing NMR more into scope of prognostic and diagnostic biomarkers, reducing the need for less efficient methods such as ultracentrifugation or high-performance liquid chromatography (HPLC).
Project description:Sedimentation has been a standard methodology for particle size analysis since the early 1900s. In recent years laser diffraction is beginning to replace sedimentation as the prefered technique in some industries, such as marine sediment analysis. However, for the particle size analysis of soils, which have a diverse range of both particle size and shape, laser diffraction still requires evaluation of its reliability. In this study, the sedimentation based sieve plummet balance method and the laser diffraction method were used to measure the particle size distribution of 22 soil samples representing four contrasting Australian Soil Orders. Initially, a precise wet riffling methodology was developed capable of obtaining representative samples within the recommended obscuration range for laser diffraction. It was found that repeatable results were obtained even if measurements were made at the extreme ends of the manufacturer's recommended obscuration range. Results from statistical analysis suggested that the use of sample pretreatment to remove soil organic carbon (and possible traces of calcium-carbonate content) made minor differences to the laser diffraction particle size distributions compared to no pretreatment. These differences were found to be marginally statistically significant in the Podosol topsoil and Vertosol subsoil. There are well known reasons why sedimentation methods may be considered to 'overestimate' plate-like clay particles, while laser diffraction will 'underestimate' the proportion of clay particles. In this study we used Lin's concordance correlation coefficient to determine the equivalence of laser diffraction and sieve plummet balance results. The results suggested that the laser diffraction equivalent thresholds corresponding to the sieve plummet balance cumulative particle sizes of < 2 μm, < 20 μm, and < 200 μm, were < 9 μm, < 26 μm, < 275 μm respectively. The many advantages of laser diffraction for soil particle size analysis, and the empirical results of this study, suggest that deployment of laser diffraction as a standard test procedure can provide reliable results, provided consistent sample preparation is used.
Project description:Analyzing the dynamics of soil particle size distribution (PSD) and erodibility is important for understanding the changes of soil texture and quality after cropland abandonment. This study aimed to determine how restoration age and latitude affect soil erodibility and the multifractal dimensions of PSD during natural recovery. We collected soil samples from grassland, shrubland, and forests with different restoration ages in the steppe zone (SZ), forest-steppe zone (FSZ), and forest zone (FZ). Various analyses were conducted on the samples, including multifractal analysis and erodibility analysis. Our results showed that restoration age had no significant effect on the multifractal dimensions of PSD (capacity dimension (D0), information dimension (D1), information dimension/capacity dimension ratio (D1/D0), correlation dimension (D2)), and soil erodibility. Multifractal dimensions tended to increase, while soil erodibility tended to decrease, with restoration age. Latitude was negatively correlated with fractal dimensions (D0, D2) and positively correlated with K and D1/D0. During vegetation restoration, restoration age, precipitation, and temperature affect the development of vegetation, resulting in differences in soil organic carbon, total nitrogen, soil texture, and soil enzyme activity, and by affecting soil structure to change the soil stability. This study revealed the impact of restoration age and latitude on soil erosion in the Loess Plateau.
Project description:BackgroundThe use of visible-near infrared (vis-NIR) spectroscopy for rapid soil characterisation has gained a lot of interest in recent times. Soil spectra absorbance from the visible-infrared range can be calibrated using regression models to predict a set of soil properties. The accuracy of these regression models relies heavily on the calibration set. The optimum sample size and the overall sample representativeness of the dataset could further improve the model performance. However, there is no guideline on which sampling method should be used under different size of datasets.MethodsHere, we show different sampling algorithms performed differently under different data size and different regression models (Cubist regression tree and Partial Least Square Regression (PLSR)). We analysed the effect of three sampling algorithms: Kennard-Stone (KS), conditioned Latin Hypercube Sampling (cLHS) and k-means clustering (KM) against random sampling on the prediction of up to five different soil properties (sand, clay, carbon content, cation exchange capacity and pH) on three datasets. These datasets have different coverages: a European continental dataset (LUCAS, n = 5,639), a regional dataset from Australia (Geeves, n = 379), and a local dataset from New South Wales, Australia (Hillston, n = 384). Calibration sample sizes ranging from 50 to 3,000 were derived and tested for the continental dataset; and from 50 to 200 samples for the regional and local datasets.ResultsOverall, the PLSR gives a better prediction in comparison to the Cubist model for the prediction of various soil properties. It is also less prone to the choice of sampling algorithm. The KM algorithm is more representative in the larger dataset up to a certain calibration sample size. The KS algorithm appears to be more efficient (as compared to random sampling) in small datasets; however, the prediction performance varied a lot between soil properties. The cLHS sampling algorithm is the most robust sampling method for multiple soil properties regardless of the sample size.DiscussionOur results suggested that the optimum calibration sample size relied on how much generalization the model had to create. The use of the sampling algorithm is beneficial for larger datasets than smaller datasets where only small improvements can be made. KM is suitable for large datasets, KS is efficient in small datasets but results can be variable, while cLHS is less affected by sample size.