Ioannis Koumarelas, PhD

Ioannis Koumarelas, PhD

Machine Learning Engineer
PhD in Data Quality

Veeva Systems

Mission Statement

Machine learning engineer (PhD) with 5+ years of experience building and deploying production ML systems at scale. Deep expertise in data quality, entity resolution, and duplicate detection. Actively exploring LLMs and agentic AI to deliver intelligent automation.

Experience

Data Scientist / Senior Data Scientist

Veeva Systems

Creating and updating profiles of medical experts (Health Care Professionals – HCPs) while continuously monitoring and ensuring high data quality.

  • Processing millions of Health-Care Professionals’ activities (e.g., publications) to create and update millions of professional profiles.
  • Analyzing data to assess quality and support decisions, training machine learning models on large-scale curated datasets, and deploying them to production.
  • Turning ideas into experiments in Jupyter Notebooks and building production-grade, resilient PySpark pipelines. Tech stack includes AWS, Apache Airflow, Kubernetes, Docker, and more.

Data Engineer / Full-Stack Engineer

HPI Schul-Cloud

Managed data workflows and improved content quality for the HPI SchulCloud platform, delivering content to multiple states and their schools across Germany.

  • Imported and scraped data using Scrapy; managed the platform’s database and content infrastructure based on Tomcat, PostgreSQL, and ElasticSearch.
  • Improved data quality through validation, enrichment, and consistency checks.
  • Supported multiple teams, broadening my full-stack expertise—from DevOps (Docker, Kubernetes) to backend development (JavaScript, later TypeScript), and front-end development (Vue.js, Next.js), as well as writing unit and end-to-end tests using Cucumber (Gherkin).

Research Consultant

SAP Concur
In the first three years of my PhD, in collaboration with SAP and in particular SAP Concur, we develop approaches to perform data cleaning and deduplication on hotel datasets provided by our partners from Concur. Several methodologies were developed and the most notable ones produced two publications. Doing an applied PhD was a remarkable experience to get hands-on knowledge.

Technical Skills

Programming Languages

Python SQL Java JavaScript C/C++

ML & Data

PySpark scikit-learn PyTorch Pandas MLflow Apache Spark Apache Airflow

Infrastructure & DevOps

AWS (EMR, S3) Docker Kubernetes CI/CD FastAPI Git pytest

Databases

PostgreSQL MongoDB

Data Quality

Entity Resolution Duplicate Detection Record Linkage Data Cleaning Data Preparation

AI & LLMs

Large Language Models LangChain Agentic AI

Certificates

AI & LLM Engineering (Udemy)
Udemy ∙ October 2025

Completed during July - October 2025 a comprehensive series of courses covering modern AI engineering practices:

Courses completed:

Generative AI with Large Language Models
Coursera ∙ July 2025
Three-week course covering the complete LLM lifecycle: Transformer architecture and pretraining, fine-tuning techniques including Parameter Efficient Fine-Tuning (PEFT) with LoRA and Soft Prompts, and Reinforcement Learning with Human Feedback (RLHF). Explored Chain-of-Thought reasoning and the ReAct framework that underlies modern agentic AI systems.
Deep Learning Specialization
Coursera ∙ January 2021

Foundational specialization from Coursera on Deep Learning. Comprised of the following courses:

  1. Neural Networks and Deep Learning
  2. Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
  3. Structuring Machine Learning Projects
  4. Convolutional Neural Networks
  5. Sequence Models

Through it I got a hollistic refreshment and further expansion of my knowledge on the primary Deep Learning fundamentals and models.

Education

PhD in Information Systems

Hasso Plattner Institute
Thesis on Data Preparation and Domain-Agnostic Duplicate Detection. Supervised by Prof. Felix Naumann. Published 7 papers in top-tier journals and conferences.
Read dissertation

MSc Computer Science

Aristotle University of Thessaloniki
Specialized in Data Engineering and efficient calculation of Theta-Joins on large-scale data using Apache MapReduce.
Read thesis

BSc Computer Science

Aristotle University of Thessaloniki
Thesis on Recommender Systems on large-scale data using Apache MapReduce.
Read thesis (in Greek)

Languages

🇬🇷 Greek Native
🇬🇧 English Fluent
🇩🇪 German Beginner