Preethy P Johny

About Me

Data Scientist with hands-on experience in multilingual NLP, language identification, sentiment and emotion analysis, and robust model evaluation. I focus on building practical, well-evaluated machine learning systems with strong attention to error analysis, data quality, and real-world deployment challenges.

Professional Projects

Multilingual Language Identification (Professional Project)

Built and evaluated a language identification system across 16 languages, focusing on balanced training data, tokenizer experimentation (SentencePiece), and robustness evaluation on code-switched and out-of-distribution text.

Code-Switched Dataset Creation (Professional Project)

Designed synthetic code-switched datasets with controlled language-pair combinations and switching frequency to stress-test multilingual NLP models under realistic scenarios.

Sentiment & Emotion Analysis (Professional Project)

Worked on sentiment analysis beyond polarity, including emotion detection such as empathy and politeness, using both traditional machine learning and transformer-based approaches in applied NLP settings.

Topic Modelling (Professional Project)

Applied topic modelling techniques to discover latent themes in large text corpora using methods such as LDA and contextual embeddings, with emphasis on topic interpretability, coherence, and noisy real-world data.

End-to-End Machine Learning Pipeline (Professional Project)

Built end-to-end machine learning pipelines covering data cleaning, feature engineering, preprocessing pipelines, model training, evaluation, and reproducibility following industry best practices.

Selected GitHub Projects

End-to-End Customer Churn Prediction (GitHub)

Developed a production-style machine learning pipeline for customer churn prediction, including preprocessing, feature engineering, model training, and evaluation.

Sentiment Analysis on YouTube Comments (GitHub)

Performed sentiment analysis on YouTube comments to analyze public sentiment using NLP techniques and exploratory analysis.

Traditional RAG-based RBI Report Analysis (GitHub)

Implemented a traditional retrieval-augmented generation approach to analyze and summarize RBI financial reports sourced from official data.