Download PDF of Apache Spark Interview Questions

51. Define Spark architecture

Ans: Spark uses a master/worker architecture. There is a driver that talks to a single coordinator called master that manages workers in which executors run. The driver and the executors run in their own Java processes.

Premium Training : Spark Full Length Training : with Hands On Lab

52. What is the purpose of Driver in Spark Architecture?

Ans: A Spark driver is the process that creates and owns an instance of SparkContext. It is your Spark application that launches the main method in which the instance of SparkContext is created.

· Drive splits a Spark application into tasks and schedules them to run on executors.

· A driver is where the task scheduler lives and spawns tasks across workers.

· A driver coordinates workers and overall execution of tasks.

Premium : Cloudera Hadoop and Spark Developer Certification Material

53. Can you define the purpose of master in Spark architecture?

Ans: A master is a running Spark instance that connects to a cluster manager for resources. The master acquires cluster nodes to run executors.

 

54. What are the workers?

Ans: Workers or slaves are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark. A worker receives serialized/marshalled tasks that it runs in a thread pool.

Premium Training : Spark Full Length Training : with Hands On Lab

55. Please explain, how worker’s work, when a new Job submitted to them?

Ans: When SparkContext is created, each worker starts one executor. This is a separate java process or you can say new JVM, and it loads application jar in this JVM. Now executors connect back to your driver program and driver send them commands, like, foreach, filter, map etc. As soon as the driver quits, the executors shut down

 Databricks Spark 2.x Developer Certification    Databricks PySpark 2.x (Python Spark) Certification Exam     Oreilly Databricks Spark Certification     Hortonworks HDPCD Spark Certification     Cloudera CCA175 Hadoop and Spark Developer Certifications     MapR V2 Spark Developer Certification Exam

 

56. Please define executors in detail?

Ans: Executors are distributed agents responsible for executing tasks. Executors provide in-memory storage for RDDs that are cached in Spark applications. When executors are started they register themselves with the driver and communicate directly to execute tasks. [112]

 

57. What is DAGSchedular and how it performs?

Ans: DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling, i.e. after an RDD action has been called it becomes a job that is then transformed into a set of stages that are submitted as TaskSets for execution.

Premium : Hortonworks Spark Developer Certification Material (HDPCD:Spark)

DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially.

 

58. What is stage, with regards to Spark Job execution?

Ans: A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of a function executed as part of a Spark job.

 

59. What is Task, with regards to Spark Job execution?

Ans: Task is an individual unit of work for executors to run. It is an individual unit of physical execution (computation) that runs on a single machine for parts of your Spark application on a data. All tasks in a stage should be completed before moving on to another stage.

· A task can also be considered a computation in a stage on a partition in a given job attempt.

· A Task belongs to a single stage and operates on a single partition (a part of an RDD).

· Tasks are spawned one by one for each stage and data partition.

 

60. What is Speculative Execution of a tasks?

Ans: Speculative tasks or task strugglers are tasks that run slower than most of the all tasks in a job.

 

Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset . Such slow tasks will be re-launched in another worker. It will not stop the slow tasks, but run a new copy in parallel.

Premium Training : Spark Full Length Training : with Hands On Lab

Spark Professional Training   Spark SQL Hands Training   PySpark : HandsOn Professional Training    Apache NiFi (Hortonworks DataFlow) Training   Hadoop Professional Training   Cloudera Hadoop Admin Training Course-1  HBase Professional Traininghttp  SAS Base Certification Hands On Training OOzie Professional Training     AWS Solution Architect : Training Associate

AWS Exam Prepare : Kinesis Data Stream   Free Core Java 1Z0-808 Training   Scala Professional Training   Python Professional Training  Read Spark SQL Fundamental and Cookbookhttps://sites.google.com/training4exam.com/spark-sql-2-x-fundamentals/  Book : AWS Solution Architect Associate : Little Guide  NiFi CookBook By HadoopExam  AWS Security Specialization Certification: Little Guide SCS-C01     Spark Interview Questions