Ioannis Koumarelas, PhD

Ioannis Koumarelas, PhD

Machine Learning Engineer
PhD in Data Quality

Mission Statement

Data scientist (PhD) with 5+ years of experience building and deploying production ML systems at scale in the life sciences domain. Proven track record designing scalable pipelines that process billions of data points and transforming research prototypes into production-grade systems. Deep expertise in data quality, entity resolution, and duplicate detection. Actively building skills in LLMs and agentic AI to deliver intelligent automation.

Experience

Senior Data Scientist / Data Scientist

Veeva Systems – Link Product

Senior Data Scientist (Mar 2024 – Feb 2026) · Data Scientist (Dec 2021 – Feb 2024)

  • Built scalable ML models for clustering medical activities into expert profiles, applying entity resolution and duplicate detection at production scale – processing billions of activity pairs per run and generating millions of automated profiles across US, EU, LATAM, and APAC regions.
  • Transformed exploratory Jupyter Notebook prototypes into production-ready PySpark + Airflow pipelines on AWS EMR, with MLflow for experiment tracking and model deployment, Docker and Kubernetes for containerized services, testing, monitoring, and CI/CD integration, collaborating with cross-functional engineering teams.
  • Assessed data quality using precision–recall metrics with threshold-based quality tiers, ensuring very high precision while substantially reducing manual curation costs.
  • Organized Data Science meetups, technical talks, and team activities to promote knowledge sharing and strengthen engineering culture.

Full-Stack Engineer / Data Engineer & Technical Team Co-Leader

HPI Schul-Cloud – Dataport

Full-Stack Engineer (Jan 2021 – Nov 2021) · Data Engineer / Technical Team Co-Leader (Apr 2020 – Dec 2020)

  • Built and maintained data pipelines for 300k+ educational assets, improving structure, reliability, and discoverability for end users.
  • Implemented systematic data preparation, cleaning workflows, and duplicate-detection methods to ensure data quality at scale.
  • Contributed across the full stack (Python, Vue.js, PostgreSQL, Docker, Kubernetes) to maintain and scale the educational platform.
  • Led technical requirements clarification, team operations, and onboarding during a multi-month organizational transition.

Research Consultant

SAP & SAP Concur
  • Developed 3 novel ML pipelines in Python and Java to improve duplicate detection, increasing matching success by 18%.
  • Delivered on-site technical tutorials at SAP Concur Seattle (USA) on data matching classification and pipeline optimization.

Technical Skills

Programming Languages

Python SQL Java JavaScript C/C++

ML & Data

PySpark scikit-learn PyTorch Pandas MLflow Apache Spark Apache Airflow

Infrastructure & DevOps

AWS (EMR, S3) Docker Kubernetes CI/CD FastAPI Git pytest

Databases

PostgreSQL MongoDB

Data Quality

Entity Resolution Duplicate Detection Record Linkage Data Cleaning Data Preparation

AI & LLMs

Large Language Models LangChain Agentic AI

Certificates

AI & LLM Engineering (Udemy)
Udemy ∙ October 2025

Completed during July - October 2025 a comprehensive series of courses covering modern AI engineering practices:

Courses completed:

Generative AI with Large Language Models
Coursera ∙ July 2025
Three-week course covering the complete LLM lifecycle: Transformer architecture and pretraining, fine-tuning techniques including Parameter Efficient Fine-Tuning (PEFT) with LoRA and Soft Prompts, and Reinforcement Learning with Human Feedback (RLHF). Explored Chain-of-Thought reasoning and the ReAct framework that underlies modern agentic AI systems.
Deep Learning Specialization
Coursera ∙ January 2021

Foundational specialization from Coursera on Deep Learning. Comprised of the following courses:

  1. Neural Networks and Deep Learning
  2. Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
  3. Structuring Machine Learning Projects
  4. Convolutional Neural Networks
  5. Sequence Models

Through it I got a hollistic refreshment and further expansion of my knowledge on the primary Deep Learning fundamentals and models.

Education

Intensive German Course – Levels A2.2, B1.1, B1.2

Die Neue Schule, Berlin
Intensive German language course in Berlin, progressing through levels A2.2, B1.1, and B1.2.

PhD in Computer Science – Data Preparation & Domain-Agnostic Duplicate Detection

Hasso Plattner Institute
Thesis on Data Preparation and Domain-Agnostic Duplicate Detection, supervised by Prof. Felix Naumann. Defended with distinction (Magna cum Laude). Published 7 papers in top-tier journals and conferences. Organized 6 project seminars on Duplicate Detection, Data Preparation, Blockchain, Text Mining, and Recommender Systems.
Read dissertation

MSc Computer Science – Theta-Joins on MapReduce

Aristotle University of Thessaloniki
Implemented thesis in Python, Java, and Hadoop; published in top-tier conference. Awarded State Scholarship Foundation scholarship. Vice Chair of local ACM Student Chapter. Participated in ACM SIGMOD 2013 programming contest (streaming system in C++).
Read thesis

BSc Computer Science – Recommender System on MapReduce

Aristotle University of Thessaloniki
Implemented thesis in Java and Hadoop; published in top-tier journal. Interned at IT Center performing system and database administration.
Read thesis (in Greek)

Languages

🇬🇷 Greek Native
🇬🇧 English Fluent
🇩🇪 German Intermediate (B1)