PySpark Developer - Complex XML Data Processing

5 - 9 years

7.0 - 11.0 Lacs P.A.

Chennai, Pune, Delhi, Mumbai, Bengaluru, Hyderabad, Kolkata

Posted:2 months ago| Platform: Naukri logo

Apply Now

Skills Required

Computer scienceorchestrationMemory managementXMLMachine learningSchemaSCALAData processingJSONAnalytics

Work Mode

Work from Office

Job Type

Full Time

Job Description

Senior PySpark Developer - Complex XML Data Processing Key Responsibilities: Design and develop scalable PySpark pipelines to ingest, parse, and process XML datasets with extreme hierarchical complexity. Implement efficient XPath expressions, recursive parsing techniques, and custom schema definitions to extract data from nested XML structures. Optimize Spark jobs through partitioning, caching, and parallel processing to handle terabytes of XML data efficiently. Transform raw hierarchical XML data into structured Data Frames for analytics, machine learning, and reporting use cases. Collaborate with data architects and analysts to define data models for nested XML schemas. Troubleshoot performance bottlenecks and ensure reliability in distributed environments (e.g., AWS, Databricks, Hadoop). Document parsing logic, data lineage, and optimization strategies for maintainability. Qualifications: 5+ years of hands-on experience with PySpark and Spark XML libraries (e.g., `spark-xml`) in production environments. Proven track record of parsing XML data with 20+ levels of nesting using recursive methods and schema inference. Expertise in XPath, XQuery, and DataFrame transformations (e.g., `explode`, `struct`, `selectExpr`) for hierarchical data. Strong understanding of Spark optimization techniques: partitioning strategies, broadcast variables, and memory management. Experience with distributed computing frameworks (e.g., Hadoop, YARN) and cloud platforms (AWS, Azure, GCP). Familiarity with big data file formats (Parquet, Avro) and orchestration tools (Airflow, Luigi). Bachelor s degree in Computer Science, Data Engineering, or a related field. Preferred Skills: Experience with schema evolution and versioning for nested XML/JSON datasets. Knowledge of Scala or Java for extending Spark XML libraries. Exposure to Databricks, Delta Lake, or similar platforms. Certifications in AWS/Azure big data technologies.

Technology / Software Development
Bangalore

RecommendedJobs for You

Chennai, Pune, Delhi, Mumbai, Bengaluru, Hyderabad, Kolkata

Pune, Bengaluru, Mumbai (All Areas)

Chennai, Pune, Delhi, Mumbai, Bengaluru, Hyderabad, Kolkata

Bengaluru, Hyderabad, Mumbai (All Areas)

Hyderabad, Gurgaon, Mumbai (All Areas)