Submission Date: 2018-03-03
Dear colleagues, We are pleased to announce the 2018 Recruitment of HEIBRiDS, a new Research School in Data Science. Founded in 2018, HEIBRiDS is a cooperation between the Einstein Center for Digital Future involving four Berlin universities and six Helmholtz Centers (AWI, DESY, DLR, GfZ, HZB and MDC) in the capital region and offers projects in a collaborative research environment across disciplinary borders. Doctoral training at the HEIBRiDS partner organizations covers the areas of o Machine Learning and Data Mining o Modeling o Novel Hardware Concepts o Imaging o Visualization with applications including but not limited to genomics, structural biology, medicine, remote sensing, seismology, earth - and space sciences. We are looking for highly motivated students (with a corresponding master's degree) with a background in quantitative sciences (computer science, physics, statistics or mathematics) or related applied subjects (bio- or geo-informatics, etc). We offer - Cutting-edge research in big data science under close supervision of a team of scientific mentor and co-mentor - Integrated training curriculum of scientific and professional skills taught in English - Financial support for conferences and collaborations - Institutions that value diversity and promote an inclusive work environment - Fully-funded positions for 3 years with the option of 1 year extension - Competitive salaries according to German E13 Interested candidates are invited to apply via our online application portal https://heibrids.mdc-berlin.de/intern/start_start_for.php until March 3, 2018. Please distribute to potentially interested students or contacts of yours. Thank you very much in advance! One example topic, for more see the website: End-to-End Management of Experimental Data Science Pipelines on Biomedical Molecular Data Proposed by: Volker Markl, TU Berlin & Uwe Ohler, Helmholtz/MDC 1. PhD Topic & Short Description Developing and applying data science methods typically involves specifying and executing complex data processing and analysis pipelines, comprising pre-processing steps, model building, as well as evaluation. Heterogeneous data sources and systems for executing such pipelines can introduce complex dependencies on data or even processing architectures. When training a neural network, for example, sample transformations and preprocessing steps may be carried out with custom scripts, while the actual training may be executed on state of the art systems such as TensorFlow or MXNet, or scalable systems such as Spark and Flink. In order to simplify and automate the data analysis process, including interactive and iterative data selection or hyperparameter tuning, it is imperative to declaratively specify such pipelines and map them to potentially changing target systems and data sets. A declarative specification could enable automation and reproducibility of a data analysis process, and even help with detecting and validating properties of responsible data management [AMS16], such as fairness, transparency, or the diversity. Current data analysis pipelines lack holistic declarative end-to-end specifications, preventing automatic reproducibility, comparability, re-use of previous results and models, and testing of experiments for properties of responsible data management [SVK17, VBP12, BCF17]. Training performance and prediction quality critically depend on configuration- and hyperparameters, but metadata, lineage information, and results of experiments are not systematically tracked and stored in a structured manner. Rather, these parameters are determined ad-hoc, or by using heuristics or explorative grid search for each pipeline anew. In order to overcome these deficiencies and challenges, we propose the introduction of truly declarative specifications of such pipelines and the creation of a repository of declarative descriptions of machine learning experiments and their corresponding evaluation data in an experiment database. We further propose to research and evaluate optimization and automation of the data science process, both in multi-tenant environments and the continuous deployment of machine learning pipelines, through several or all of the following aspects: � Design an experiment database that contains detailed information about datasets, available machine learning algorithms, hyperparameters, and past experiments, their lineage and results. � Utilize the experiment database to optimize the workload execution by querying and reusing �similar� past operations, automatically selecting the best hyperparameters for new pipelines, leveraging database optimization techniques such as caching, lazy evaluation and multi-query optimization as well as determining novel, workflow specific optimizations. � Utilize the experiment database to test for properties of responsible data management, such as transparency, fairness, diversity, privacy, in order to prevent biases or other fallacies during interactive data science experimentation. � Apply these concepts on community benchmarking efforts where many algorithms are applied in a controlled setting, which will for instance allow for evaluating the influence of pre- or post-processing decoupled from the machine learning algorithms themselves. Examples include the biomedical �challenges� DREAM [dreamchallenges.org] and CAGI [genomeinterpretation.org]. 2. Interdisciplinarity Complex data processing pipelines arise virtually all application domains of data science. In this work. we focus on the end-to-end management of machine learning tasks involving high-throughput molecular data. Tackling the problems outlined above with database systems research inspired optimization and automation techniques represents a truly interdisciplinary endeavor: It requires understanding the requirements and challenges of a data science problem in one or more domains in order to generalize, abstract, and implement an experiment database, a metadata model and potential optimizations and tools for responsible data management. In turn, the application domain benefits from rigorous experimental planning and gaining information about domain-relevant performance aspects across multiple experiments and approaches. 3 Important publications related to the topic [AKK15] A. Alexandrov, A. Kunft, A. Katsifodimos, Felix S., L. Thamsen, O. Kao, T. Herb, V Markl, Implicit Parallelism through Deep Language Embedding, SIGMOD Conference 2015 [AMS16] Abiteboul, Serge ; Miklau, Gerome ; Stoyanovich, Julia ; Weikum, Gerhard: Data, Responsibly (Dagstuhl Seminar 16291). Dagstuhl Reports 6(7): 42-71 (2016) [BCF17] D. Baylor, E. Breck, H. Cheng, N. Fiedel, et al., The Anatomy of a Production-Scale Continuously-Training Machine Learning Platform, KDD, 2017 [SVK17] E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, B. Recht, Keystoneml: Optimizing pipelines for large-scale advanced analytics, ICDE 2017 [VBP12] J. Vanschoren, H. Blockeel, B. Pfahringer, G. Holmes, Experiment databases: A new way to share, organize and learn from experiments, Machine Learning 87.2 (2012): 127-158, 2012 [GWCV+17] G�nen M, Weir BA, Cowley GS, et al. A Community Challenge for Inferring Genetic Predictors of Gene Essentialities through Analysis of Cancer Cell Lines. Cell Syst, in press. [DWBB+17] Daneshjou R, Wang Y, et al. Working toward precision medicine: Predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges. Hum Mutat. 2017 38(9) [GFYW+17] Gr�ning BA, Fallmann J, Yusuf D, Will S, et al. The RNA workbench: best practices for RNA and high-throughput sequencing bioinformatics in Galaxy. Nucleic Acids Res, in press [CMWU16] Calviello L, Mukherjee N, Wyler E, Ohler U. Detecting actively translated open reading frames in ribosome profiling data. Nature Methods 2016. [KOD16] Kassuhn W, Ohler U, Drewe P. Cseq-simulator: A data simulator for clip-seq experiments. Pac Symp Biocomput. 21:433-44, 2016.