data engineer Archives

Built sophisticated web crawlers to extract vast volumes of real-estate and construction data
Developed operational monitoring and alerting system to ensure quality of the scrapers
Assisted junior engineers whenever needed. Supported them by doing pair programming
Performed code reviews to ensure quality and adherence to standards

Analyzed missions to predict future manpower requirements for operations using SQL, Python and Regression Analysis.
Analyzed the existing data and created new schema based on the new design
Written stored procedures, triggers using SQL.
Deployment of Application on Test and Production Server.

Developed data warehouse for Online Sales Europe Division
Innovated automation for SQL and Unix code generation
Participated in regular Client visits & interactions on Retail Fashion domain
Also worked with K-Nearest Neighbors, Apriori algorithms for product recommendations including content-based filtering and collaborative filtering methods.

Perform the Sql query on data to get the managed data for model
Implemented several Natural Language Processing mechanisms for Spam Filtering, Chatbots.
Worked with NLTK, SciPy, Polyglot for developing various NLP tasks.
Worked with Amazon Redshift tools like SQL workbench/J, PG Admin, DB Hawk.

Error fixing and resolution.
Used high correlation filter, low variance filter and random forest for feature selection.
Worked with Times Series Forecasting to predict the sales production based on the historic data for product recommendation.
Worked with various classification algorithms including Naïve Bayes, Random forest, Support Vector Machines, Logistic Regression etc.

Collecting data from various data sources including oracle database server and customer support department and integrating those into a single data set.
Imported python and statistics libraries like NumPy, Pandas, Sklearn, Seaborn.
Worked with Data preprocessing techniques like checking the data normally distributed and implemented log transformation, Box-Cox, cube root, square root transformations.
Performed and treated outliers and missing values detected using boxplots and Pandas predefined functions.
For performing data mining, data analysis, and predictive modelling worked with Java machine learning Library WEKA.
Experience with container-based deployment using Docker, work with Docker images and Docker Registers.
Worked with various dimensionality reduction techniques like Principal Component Analysis (PCA), Latent Discriminant Analysis (LDA), Singular Value Decomposition (SVD), Factor Analysis etc.

Involved in complete onsite-offshore meetings and discussions
Involved in production support to make sure loads are completed ontime and met the SLA
Design and implementation of a full cycle of data services
Analyzed user requirements and designed the database changes accordingly.

Creation of Perl scripts for automating the production data files download from FTP to data processing machines
Prepare test data files using Linux Shell scripts, Perl script as per the business requirements.
Developed reporting systems, tools and applications to facilitate management of content.
Managed content processing jobs and data extracts for distribution.
Reporting and logging of bug, following bug life cycle, keeping track of issues.
Work with users to review the results and assist users in obtaining user acceptance.
Enhancement of existing functionalities to meet user’s changing need.

Involve and handle Mid. Data Engineer responsibilities
Support Mid. Data Engineer
Identify ways to improve data reliability, efficiency, and quality
Align architecture with business requirements
Develop, construct, test and maintain architectures
Identify, troubleshoot and resolve complex production data integrity and performance issues
Drive engineering best practices and set standards

Managed end-to-end process for updating and verifying special orders data
Analyzed inventory usage reports to avoid backordering
Developed mass update system to avoid manual updates to data warehouse. Trained other users on the program
Expand or modify the system to serve new purposes or improve workflow

Collected, transformed, analyzed, and refined operational and customer data.
Built-out data structures designed to efficiently answer business questions.
Assisted in evolving data structures from a MSSQL Server footprint into a data lake environment.
Developed, implemented and tuned ETL processes.
Created pipelines from internal and external data sources to AWS using custom python scripts.

Collaborated with software developers to design the core engine of recommender system for claim of service/warranty for an electronic manufacturing client.
Prototyped a tool to mitigate production risks considering employee attrition and retention steps.
Ingesting structured and unstructured data
Build standard processes for data governance, data dictionary and data flows

Migrating sql server to Hive via using sqoop & nifi
Setup data pipeline to incremental import of MSSQL server to Hive with 24 hours lagging. Data volume processed daily is 200 GB.
Installed 40 nodes cloudera cluster to accomodate historical data of 80 TB on google premises.
Installed 10 nodes apache cluster to demonstrate POC on hadoop stack technologies
Using python to analyse data in csv format and providing report to management

Identifieded system data, hardware and software components required to meet the product requirements.
Prodigy (data labeling/annotations), Dataturks, Label Studio. Built our own custom annotation tool.
Google Cloud Platform and Ubuntu Virtual Machine
Proficiency in query languages like SQL and Relational DBMS.

Manage Snowflake Data Warehouse for Gartner Digital Markets team.
Developed high frequency data integration with BigQuery for ingesting sessions and hits data.
Provide advice on cost optimizations and process changes.
Setup Airflow as ETL tool to migrate existing ETL tool in effort to reduce costs by up to $25,000.

Involved in requirement gathering, analysis, design, coding, deployment for business objects
Write complex queries in SQL Server, Netezza and Snowflake as per business requirements
Developed processes to download and process files from SFTP, Amazon S3
Build processes to download files using API
Invloved in developing of ETL packages using SQL Server Integration Services (SSIS) to process different formatted files to Datawarehouse
Built different reports in SQL Server Reporting Services (SSRS) for data visualization
Performed end-to-end testing for the system including unit and system testing

Build a data warehouse solution for the new BSCS system.
Prepared & consolidated monthly Post-paid and Pre-paid Revenue reports and related data under oracle environment.
Support operation jobs and make sure that all numbers on reports are in range and jobs are done successfully.
NLTK, Tensorflow, spaCy, scikit-learn, Keras, OpenCV

Utilized Microsoft Excel software to create and analyze data.
Boosted chatbots responses by 10%.
Identified the major issues in dataset.
Doing ad-hoc analysis and presenting results in a clear manner

Streaming the data from assets continuously using Spark Streaming. Which provides highly scalable, fault- tolerant stream processing.
Processing of data obtained from the assets using in-memory cluster computing spark engine.
Sending the processed data to rule engine and validating the data with rules configured by the customer and alerting the customer.
Storing the processed data for further analysis and predictions using ML Algorithms.
Sending the streaming data to a real time dashboard for continuous monitoring.

Greenplum Query design, to be consumed in BI Reports and Live Office.
Labeling data and feature selection for training machine learning models
Building custom python AI package, NAIL (Nirveda AI Library), to help build data sets, create supervised, structured, fast & robust approach to handle pre-processing, annotation, post-processing, text verification.
Data mining and scraping data

Develop SQL scripts for a variety for reports, data corrections and data migrations
Responsible for troubleshooting various computer issues and implementing solutions
Work closely with project manager to develop work plan for Data Warehouse projects as well as implementations
Develop framework, metrics and reporting to ensure progress can be measured, evaluated and continually improved
Support the development of performance dashboards that encompass key metrics to be reviewed with senior leadership and sales management
Work with application developers and DBAs to diagnose and optimize query performance
Build shell scrips to automate tasks

Working on designing and implementing complete end to end big data pipelines using Sqoop, Flume, Spark, Kudu, HDFS and build real-time streaming using StreamSet
Distribute, store, and process data in a Umniah Hadoop cluster, Process, and query structured and non-structured data using SparkSQL, Use Spark Streaming to process live data.
APS (PDW):
Enhancing the Umniah data, warehousing system, which is Microsoft PDW technology (APS), creating efficient, flexible and scalable ETL and ELT processes to handle data movement.
Extracting and transforming large size datasets by Python scripts and loading by PowerShell Developing on-premise and on-cloud Power BI reports and migrate current DataZen dashboards, Reports, Slicers, and KPIs into Power BI also configuring gateway and Power BI report server
Developing Python tool to decode Huawei CDRs and enhance cost-effective and decoding processes
Collecting data requirement from the business team and translate it into datasets and gathering the reports need

Worked for Australian Telecommunications Client on Planning and Budgeting reports using Oracle Hyperion Suite of applications
Role as a data engineer involved data transformation, loading and reporting using ETL tools
Configuring and developing live office based on BI Reports.
BO Universe’s development/enhancement and debugging as per User Requirement.

Nirveda Cognition Inc. (Direct Hire) http://www.nirvedacognition.ai
Data Scientist | Data Annotator | ML Engineer | Artificial Intelligence
Python 3.5+, including NumPy, Pandas, and other mathematics and scientific libraries
ETL (Extract, Transform, Load) of large scale data sets
Web scraping (Selenium, Beautiful Soup, etc)
Experience with Natural Language Processing, Machine Learning, Deep Learning, Computer Vision, etc.
Building large data sets by combining internal and client data with third-party or synthetic data

Developed Enterprise Datawarehouse (EDW) from ground up gathering data across different products and data sources.
Identified system data, hardware and software components required to meet the product requirements.
Played multiple roles such as Data Architect , ETL and BI developer.
Efficiently improved report execution by performance tuning sql queries, database tables and aggregating the data.
Designed and Developed event based scheduling service to deliver automated reports to users using Microstrategy and bash scripting.
Implemented Self Serve analytical capability to business users enabling them to quickly create and share dashboards.

Work with Project Management to provide timely estimates, updates & status.
Collect, handle and document the data for insights.
Coordinate with the support team and support the management in data services.
Also, resolve the technical issues of clients.

Built recommendation engine for video content types like Linear-TV and VOD in Scala using Spark MLlib through recommendation techniques like Collaborative Filtering(includes matrix factorization techniques ), Content-Based Filtering and RNN which resulted in improvement of click-through rate(CTR) by 0.5% initially. Services include data ingestion (to Hbase through Kafka), training the models and serving(through Redis).
Built churn predictor system – System which predicts the churning probability of each user for the upcoming months based on the video usage patterns, resulted in the reduction of churn percentage by 3%(rounded)
Instrumental in developing end-to-end video analytics system for Yupptv and OTT tenants. Responsibilities includeCollecting raw data from multiple clients, process the data and load in appropriate tables by using Kafka, Elastic Search, and Redshift(ELK architecture).
Serving various analytics requests through intuitive dashboards for different types of users.
Providing the flexibility to develop own dynamic dashboards to tenants(Similar to BI tool).

AWS Certified Developer
As a part of Data engineering team, worked on multiple projects, designed and proposed architecture for couple of clients.
Hands on experience in many AWS, GCP services and multiple Apache and other open source projects.
Was responsible for code development, review and deployment.

I am currently working on setting up a framework for enabling customer segmentation. Thereby enabling the marketing team to make smart decisions based on the segments.
Responsible for architecture and development of data pipeline using Apache Kafka for consuming a huge volume of clickstream messages. Contributed to setting up systems for data warehouse and developed efficient applications for migrating application data to the data warehouse. Understanding systems that are deployed and setting up systems for monitoring key metrics for deployed systems to guarantee high availability. Understand application systems in order to develop denormalized datasets for analysis using Apache Spark.
Design and develop the architecture for BookMyShow PWA. Responsible for optimizing build times and load times for the mobile web app. Understanding the latest technologies and finding ways to implement them to solve performance bottlenecks. Implement features in website end to end right from developing components in Reactjs and integrating them with API’s. Developed API’s in NodeJS. Used docker for automated deployment.
Contributed in developing BookMyShow desktop website. Responsible for developing end to end features, right from communicating with product owners and gathering requirements to developing UI by collaborating with designers. Integrating API’s on the server side using core PHP.

Designing the Data Pipeline
Choosing the Optimum Distributed Data Store Architecture
Designed and converted about 30 BO WebI Reports from BOXI 3.1 to BI 4.0, connecting to Teradata.
Preparing report design documentation for reports to be developed and performing UAT/QA of BI Reports.

data engineer

senior data engineer

data engineer

data engineer

data engineer

data engineer /machine learning engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

senior data engineer

data engineer

data engineer

senior data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer

data engineer