Data Scientist | Multilingual NLP & Machine Learning
Data Scientist with hands-on experience in multilingual NLP, language identification, sentiment and emotion analysis, and robust model evaluation. I focus on building practical, well-evaluated machine learning systems with strong attention to error analysis, data quality, and real-world deployment challenges.
Built and evaluated a language identification system across 16 languages, focusing on balanced training data, tokenizer experimentation (SentencePiece), and robustness evaluation on code-switched and out-of-distribution text.
Designed synthetic code-switched datasets with controlled language-pair combinations and switching frequency to stress-test multilingual NLP models under realistic scenarios.
Worked on sentiment analysis beyond polarity, including emotion detection such as empathy and politeness, using both traditional machine learning and transformer-based approaches in applied NLP settings.
Applied topic modelling techniques to discover latent themes in large text corpora using methods such as LDA and contextual embeddings, with emphasis on topic interpretability, coherence, and noisy real-world data.
Built end-to-end machine learning pipelines covering data cleaning, feature engineering, preprocessing pipelines, model training, evaluation, and reproducibility following industry best practices.
Developed a production-style machine learning pipeline for customer churn prediction, including preprocessing, feature engineering, model training, and evaluation.
Performed sentiment analysis on YouTube comments to analyze public sentiment using NLP techniques and exploratory analysis.
Implemented a traditional retrieval-augmented generation approach to analyze and summarize RBI financial reports sourced from official data.
📧 Email: er.preethypjohny@gmail.com
💼 LinkedIn: linkedin.com/in/preethy-p-johny