GeMTeX: German Medical Text Corpus

Executive Summary

In everyday clinical practice, there are many texts such as doctor’s letters and findings that contain valuable information about the patient’s medical history, progression and treatment. With the help of these texts, programs for the automatic processing of natural language (natural language processing, NLP for short) could support doctors and researchers in their work. However, the full potential of clinical documents cannot be exploited due to a lack of standardization. The German Medical Text Corpus (GeMTeX) method platform aims to close this gap and aims to make medical texts from patient care available for research projects available. The aim is to create the largest medical text corpus in the German language. You might want to refer to the GeMTeX consoritum website for further details.

Project Partners


    Research Publications

    • Fox, S., Preiß, M., Borchert, F., Rasheed, A., Schapranow, M.-P.: HPIDHC at NTCIR-17 MedNLP-SC: Data Augmentation and Ensemble Learning for Multilingual Adverse Drug Event Detection. NTCIR 17 Conference: Proceedings of the 17th NTCIR Conference on Evaluation of Information Access Technologies. bll. 185–192. , Tokyo, Japan (2023).
    • Borchert, F., Llorca, I., Schapranow, M.-P.: HPI-DHC @ BC8 SympTEMIST Track: Detection and Normalization of Symptom Mentions with SpanMarker and xMEN. In: Islamaj, R., Arighi, C., Campbell, I., Gonzalez-Hernandez, G., Hirschman, L., Krallinger, M., Lima-López, S., Weissenbacher, D., en Lu, Z. (reds.) Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models. , New Orleans, LA (2023).
    • Borchert, F., Llorca, I., Roller, R., Arnrich, B., Schapranow, M.-P.: xMEN: A Modular Toolkit for Cross-Lingual Medical Entity Normalization. arXiv preprint arXiv:2310.11275. (2023).
    • Borchert, F., Llorca, I., Schapranow, M.-P.: Cross-Lingual Candidate Retrieval and Re-ranking for Biomedical Entity Linking. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Giachanou, A., Li, D., Aliannejadi, M., Vlachos, M., Faggioli, G., en Ferro, N. (reds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. bll. 135–147. Springer Nature Switzerland, Cham (2023).
    • Llorca, I., Borchert, F., Schapranow, M.-P.: A Meta-dataset of German Medical Corpora: Harmonization of Annotations and Cross-corpus NER Evaluation. Proceedings of the 5th Clinical Natural Language Processing Workshop. bll. 171–181. Association for Computational Linguistics, Toronto, Canada (2023).
    • Kämmer, N., and Borchert, F., and Winkler, S., and de Melo, G., and Schapranow, M.-P.: Resolving Elliptical Compounds in German Medical Text. The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. bll. 292–305. Association for Computational Linguistics, Toronto, Canada (2023).



    BMBF Logo