Anurag Pandey

@anurag_pandey

Databricks Certified Data Engineer Professional | Big Data & Cloud Solutions Expert

Bengaluru

https://www.linkedin.com/in/anurag-pandey-2524aba5/

EnqueroDatabricks

A strong background in data engineering is offered, with over 9 years of experience in developing scalable solutions and optimizing data workflows using Databricks, AWS, and Snowflake. Passionate engagement with data-driven innovations is demonstrated through various successful projects that enhanced performance and reduced costs significantly. A commitment to excellence in delivering tailored solutions aligns well with the emphasis on achieving impactful results and refining processes in dynamic environments.

Experience

Senior Data Engineer

Enquero

•Invalid Date - Present

Client - Atlassian. Developed a configuration-driven Python framework to automate Databricks job execution, dynamically processing SQL files per table for flexible (incremental/overwrite) data ingestion into Delta/Parquet. Enhanced performance and scalability through environment-aware storage management, automated freshness validation, Z-ordering, and vacuuming. Developed custom Airflow operators to streamline workflow automation while building and managing multiple Airflow Dags to orchestrate scheduled and backfill Databricks jobs. Developed a Python-based QA script in Databricks that connects via JDBC to compare table data between MySQL/HANA and Databricks using primary key joins or set-difference logic and generates Excel reports detailing per-column discrepancies, cutting manual verification efforts by 30%. Optimized Spark jobs by addressing data skew and recursive query inefficiencies: filtered nulls and applied key salting to prevent task hang-ups, and introduced check pointing to break lineage in recursive queries reducing execution time from 6 hours to 15 minutes and achieving up to 90% cost reduction. Developed a utility function that transmits real-time task status to a Streaming Endpoint, enhancing monitoring and operational insights. Leveraged REST APIs to integrate Capterra and Impact data into existing pipelines, automating ingestion and enhancing data accessibility. Enhanced CI/CD processes by integrating Bitbucket with AWS S3 and Databricks. Developed functionality to trigger single-task Databricks jobs directly from Bitbucket via dynamic payload generation based on configuration files, reducing staging deployment complexity. Received commendations from Head of Data Engineering at Atlassian for my exceptional contributions to the C360 CAR project, driving a successful beta launch, resolving 95% of critical issues, and accelerating the release timeline by 20%.

Data Engineer

EPAM Systems

•Invalid Date - Invalid Date

client - Clarivate. Migrated Snowflake stored procedures to PySpark SQL, reducing transformation costs and enabling the creation of aggregated consumption tables for end-user analytics. Developed ETL pipelines to ingest CSV data from FTP sources, cleanse data, and build reliable source-of-truth base tables. Implemented HIPAA-compliant data masking through hashing techniques to secure sensitive information. Leveraged Docker and Kubernetes for containerization and orchestration, streamlining deployment processes and enabling scalable application infrastructure.

Senior Data Engineer

Enquero

•Invalid Date - Invalid Date•India

client - VMWare. Migrated SQL Server/HANA stored procedures to PySpark, increasing query response efficiency by 30% by redesigning normalized tables as denormalized one's in Hadoop to minimize data shuffling. Crafted and refined complex PySpark scripts to replicate PostgreSQL procedures, including handling cursors and iterative logic for accurate transformation. Engineered an automated data migration framework in Python, enabling seamless transfer of source data to Hadoop (in ORC format via Sqoop) with minimal configuration. Optimized Sqoop and PySpark jobs by fine-tuning mappers and partitioning, while leveraging techniques such as salting, broadcasting, caching, and distribute by operations to enhance performance. Conducted impact analysis to identify bottlenecks and implement targeted performance enhancements, ensuring efficient benchmarking and system reliability.

Senior Developer

Mu Sigma Inc.

•Invalid Date - Invalid Date•Bengaluru, India

Developed a comprehensive framework to migrate upstream data from HANA to Hadoop using Spark, JDBC connectors, and created Hive external tables on top that. Converted HANA stored procedures to PySpark, creating denormalized consumption tables that boosted data processing efficiency by 40% and streamlined end-user analytics. Replaced the legacy middleware with a Python-based solution to dynamically generate queries from UI filter selections. Developed a high-performance solution to precalculate 12,000 dynamic queries using Python multiprocessing and Spark, storing results in JSON for near-instantaneous UI responses and offloading live queries from HANA — reducing HANA load. Worked in an agile environment by continuously refining data flows, optimizing Sqoop and PySpark jobs, and conducting impact analyses to identify and resolve performance bottlenecks.

SQL Developer

IBM

•Invalid Date - Invalid Date

client - coop bank. Worked on two marketing projects at COOP Bank, helping to improve campaign performance. Provided production support and built over 100 Oracle SQL and PL/SQL programs to keep systems running smoothly.

Education

Databricks

Databricks Certified Data Engineer Professional

Data Engineering

Databricks

Databricks Certified Data Engineer Associate

Data Engineering

Oracle

Oracle PL/SQL Developer Certified Associate

Database Development

Oracle

Oracle SQL 11g

Database Development

Licenses & Certifications

Databricks Certified Data Engineer Professional

Databricks

• No expiration

Databricks Certified Data Engineer Associate

Databricks

• No expiration

Oracle PL/SQL Developer Certified Associate

Oracle

• No expiration

Oracle SQL 11g

Oracle

• No expiration

Skills

PySpark

Python

Databricks

kubernetes

Hive

Airflow

Docker

Delta Table

Query Optimization

AWS

EC2

RDS

presto

atomic

Unix Shell Scripting

Bitbucket

Git

JIRA

Sqoop

Data Warehousing

Data Modelling