You will be taking part in developing our spark data processing pipeline that works in batch. This data processing system (Redshift, Aurora, Scala Spark on EMR) is closely integrated and complementary to our real time processing system.
One major task will involve working on our data deduplication algorithm that processes billions of records, to optimize and find new methods to improve data quality. This work will be done in close relationship to our data science team. Integration with real time processes will also be part of your assignment.
If you’re passionate about software development and want to join a dynamic company, please send your resume to firstname.lastname@example.org.