Dataset Information

Crawling the German Health Web: Exploratory Study and Graph Analysis.

ABSTRACT: BACKGROUND:The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3). OBJECTIVE:This study demonstrates the suitability of a focused crawler for the acquisition of the German Health Web (GHW) which includes all health-related web content of the three mostly German speaking countries Germany, Austria and Switzerland. Based on the gathered data, we provide a preliminary analysis of the GHW's graph structure covering its size, most important content providers and a ratio of public to private stakeholders. In addition, we provide our experiences in building and operating such a highly scalable crawler. METHODS:A support vector machine classifier was trained on a large data set acquired from various German content providers to distinguish between health-related and non-health-related web pages. The classifier was evaluated using accuracy, recall and precision on an 80/20 training/test split (TD1) and against a crowd-validated data set (TD2). To implement the crawler, we extended the open-source framework StormCrawler. The actual crawl was conducted for 227 days. The crawler was evaluated by using harvest rate and its recall was estimated using a seed-target approach. RESULTS:In total, n=22,405 seed URLs with country-code top level domains .de: 85.36% (19,126/22,405), .at: 6.83% (1530/22,405), .ch: 7.81% (1749/22,405), were collected from Curlie and a previous crawl. The text classifier achieved an accuracy on TD1 of 0.937 (TD2=0.966), a precision on TD1 of 0.934 (TD2=0.954) and a recall on TD1 of 0.944 (TD2=0.989). The crawl yields 13.5 million presumably relevant and 119.5 million nonrelevant web pages. The average harvest rate was 19.76%; recall was 0.821 (4105/5000 targets found). The resulting host-aggregated graph contains 215,372 nodes and 403,175 edges (network diameter=25; average path length=6.466; average degree=1.872; average in-degree=1.892; average out-degree=1.845; modularity=0.723). Among the 25 top-ranked pages for each country (according to PageRank), 40% (30/75) were web sites published by public institutions. 25% (19/75) were published by nonprofit organizations and 35% (26/75) by private organizations or individuals. CONCLUSIONS:The results indicate, that the presented crawler is a suitable method for acquiring a large fraction of the GHW. As desired, the computed statistical data allows for determining major information hubs and important content providers on the GHW. In the future, the acquired data may be used to assess important topics and trends but also to build health-specific search engines.

SUBMITTER: Zowalla R

PROVIDER: S-EPMC7414401 | biostudies-literature | 2020 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Crawling the German Health Web: Exploratory Study and Graph Analysis.

Zowalla Richard R Wetter Thomas T Pfeifer Daniel D

Journal of medical Internet research 20200724 7

<h4>Background</h4>The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) import ...[more]

PMID: 32706701

Similar Datasets

Project description:BackgroundThe Dizziness Handicap Inventory (DHI) is a validated, self-report questionnaire which is widely used as an outcome measure. Previous studies supported the multidimensionality of the DHI, but not the original subscale structure. The objectives of this survey were to explore the dimensions of the Dizziness Handicap Inventory - German version, and to investigate the associations of the retained factors with items assessing functional disability and the Hospital Anxiety and Depression Scale (HADS). Secondly we aimed to explore the retained factors according to the International Classification of Functioning, Disability and Health (ICF).MethodsPatients were recruited from a tertiary centre for vertigo, dizziness or balance disorders. They filled in two questionnaires: (1) The DHI assesses precipitating physical factors associated with dizziness/unsteadiness and functional/emotional consequences of symptoms. (2) The HADS assesses non-somatic symptoms of anxiety and depression. In addition, patients answered the third question of the University of California Los Angeles-Dizziness Questionnaire which covers the impact of dizziness and unsteadiness on everyday activities. Principal component analysis (PCA) was performed to explore the dimensions of the DHI. Associations were estimated by Spearman correlation coefficients.ResultsOne hundred ninety-four patients with dizziness or unsteadiness associated with a vestibular disorder, mean age (standard deviation) of 50.6 (13.6) years, participated. Based on eigenvalues greater one respectively the scree plot we analysed diverse factor solutions. The 3-factor solution seems to be reliable, clinically relevant and can partly be explained with the ICF. It explains 49.2% of the variance. Factor 1 comprises the effect of dizziness and unsteadiness on emotion and participation, factor 2 informs about specific activities or effort provoking dizziness and unsteadiness, and factor 3 focuses on self-perceived walking ability in relation to contextual factors. The first factor correlates moderately with disability and the HADS (values >/=0.6). The second factor is comparable with the original physical subscale of the DHI and factors retained in previous studies.ConclusionsThe results of the present survey can not support the original subscale structure of the DHI. Therefore only the total scale should be used. We discuss a possible restructuring of the DHI.

Project description:BACKGROUND:Although searching for health information on the internet has offered clear benefits of rapid access to information for seekers such as patients, medical practitioners, and students, detrimental effects on seekers' experiences have also been documented. Health information overload is one such side effect, where an information seeker receives excessive volumes of potentially useful health-related messages that cannot be processed in a timely manner. This phenomenon has been documented among medical professionals, with consequences that include impacts on patient care. Presently, the use of the internet for health-related information, and particularly animal health information, in veterinary students has received far less research attention. OBJECTIVE:The purpose of this study was to explore veterinary students' internet search experiences to understand how students perceived the nature of Web-based information and how these perceptions influence their information management. METHODS:For this qualitative exploratory study, 5 separate focus groups and a single interview were conducted between June and October 2016 with a sample of 21 veterinary students in Ontario, Canada. RESULTS:Thematic analysis of focus group transcripts demonstrated one overarching theme, The Overwhelming Nature of the Internet, depicted by two subthemes: Volume and Type of Web-based Health Information and Processing, Managing, and Evaluating Information. CONCLUSIONS:Integrating electronic health information literacy training into human health sciences students' training has shown to have positive effects on information management skills. Given a recent Association of American Veterinary Medical Colleges report that considers health literacy as a professional competency, results of this study point to a direction for future research and for institutions to contemplate integrating information literacy skills in veterinary curricula. Specifically, we propose that the information literacy skills should include knowledge about access, retrieval, evaluation, and timely application of Web-based information.

Project description:BACKGROUND:Patients diagnosed with melanoma frequently search the internet for treatment information, including novel and complex immunotherapy. However, health literacy is limited among half of the German population, and no assessment of websites on melanoma treatment has been performed so far. OBJECTIVE:The aim of this study was to identify and assess the most visible websites in German language on melanoma immunotherapy. METHODS:In accordance with the common Web-based information-seeking behavior of patients with cancer, the first 20 hits on Google, Yahoo, and Bing were searched for combinations of German synonyms for "melanoma" and "immunotherapy" in July 2017. Websites that met our predefined eligibility criteria were considered for assessment. Three reviewers independently assessed their quality by using the established DISCERN tool and by checking the presence of quality certification. Usability and reliability were evaluated by the LIDA tool and understandability by the Patient Education Materials Assessment Tool (PEMAT). The Flesch Reading Ease Score (FRES) was calculated to estimate the readability. The ALEXA and SISTRIX tools were used to investigate the websites' popularity and visibility. The interrater agreement was determined by calculating Cronbach alpha. Subgroup differences were identified by t test, U test, or one-way analysis of variance. RESULTS:Of 480 hits, 45 single websites from 30 domains were assessed. Only 2 website domains displayed a German quality certification. The average assessment scores, mean (SD), were as follows: DISCERN, 48 (7.6); LIDA (usability), 40 (2.0); LIDA (reliability), 10 (1.6); PEMAT, 69% (16%); and FRES, 17 (14), indicating mediocre quality, good usability, and understandability but low reliability and an even very low readability of the included individual websites. SISTRIX scores ranged from 0 to 6872 and ALEXA scores ranged from 17 to 192,675, indicating heterogeneity of the visibility and popularity of German website domains providing information on melanoma immunotherapy. CONCLUSIONS:Optimization of the most accessible German websites on melanoma immunotherapy is desirable. Especially, simplification of the readability of information and further adaption to reliability criteria are required to support the education of patients with melanoma and laypersons, and to enhance transparency.

Dataset Information

Crawling the German Health Web: Exploratory Study and Graph Analysis.

Publications

Crawling the German Health Web: Exploratory Study and Graph Analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets