让spark 2.3, 2.4版本显示执行过程进度条的方法

  • Post author:
  • Post category:其他


原文如下

原文地址

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sparkcontext-ConsoleProgressBar.html

ConsoleProgressBar


ConsoleProgressBar

shows the progress of active stages to standard error, i.e.

stderr

. It uses

SparkStatusTracker

to poll the status of stages periodically and print out active stages with more than one task. It keeps overwriting itself to hold in one line for at most 3 first concurrent stages at a time.

[Stage 0:====>          (316 + 4) / 1000][Stage 1:>                (0 + 0) / 1000][Stage 2:>                (0 + 0) / 1000]]]

The progress includes the stage id, the number of completed, active, and total tasks.

Tip


ConsoleProgressBar

may be useful when you

ssh

to workers and want to see the progress of active stages.



ConsoleProgressBar

is created

when


SparkContext

starts

with

spark.ui.showConsoleProgress

enabled and the logging level of

org.apache.spark.SparkContext

logger as

WARN

or higher (i.e. less messages are printed out and so there is a “space” for

ConsoleProgressBar

).

import org.apache.log4j._
Logger.getLogger("org.apache.spark.SparkContext").setLevel(Level.WARN)

To print the progress nicely

ConsoleProgressBar

uses

COLUMNS

environment variable to know the width of the terminal. It assumes

80

columns.

The progress bar prints out the status after a stage has ran at least

500

milliseconds every

spark.ui.consoleProgress.update.interval

milliseconds.

Note

The initial delay of

500

milliseconds before

ConsoleProgressBar

show the progress is not configurable.

See the progress bar in Spark shell with the following:

$ ./bin/spark-shell --conf spark.ui.showConsoleProgress=true  (1)

scala> sc.setLogLevel("OFF")  (2)

scala> import org.apache.log4j._
import org.apache.log4j._

scala> Logger.getLogger("org.apache.spark.SparkContext").setLevel(Level.WARN)  (3)

scala> sc.parallelize(1 to 4, 4).map { n => Thread.sleep(500 + 200 * n); n }.count  (4)
[Stage 2:>                                                          (0 + 4) / 4]
[Stage 2:==============>                                            (1 + 3) / 4]
[Stage 2:=============================>                             (2 + 2) / 4]
[Stage 2:============================================>              (3 + 1) / 4]
  1. Make sure

    spark.ui.showConsoleProgress

    is

    true

    . It is by default.

  2. Disable (

    OFF

    ) the root logger (that includes Spark’s logger)

  3. Make sure

    org.apache.spark.SparkContext

    logger is at least

    WARN

    .

  4. Run a job with 4 tasks with 500ms initial sleep and 200ms sleep chunks to see the progress bar.

简言之:

1、如果是使用idea、eclipse等ide编写代码,需要以下2步:

1)  SparkSession 、SparkConf 创建之前,加入如下代码

import org.apache.log4j._
Logger.getLogger("org.apache.spark.SparkContext").setLevel(Level.WARN)

2) 创建SparkSession时设置 spark.ui.showConsoleProgress 为 true

2、如果是使用 spark-shell 则需要   如下操作:

$ ./bin/spark-shell --conf spark.ui.showConsoleProgress=true  (1)

scala> sc.setLogLevel("OFF")  (2)

scala> import org.apache.log4j._
import org.apache.log4j._

scala> Logger.getLogger("org.apache.spark.SparkContext").setLevel(Level.WARN)  (3)

scala> sc.parallelize(1 to 4, 4).map { n => Thread.sleep(500 + 200 * n); n }.count  (4)
[Stage 2:>                                                          (0 + 4) / 4]
[Stage 2:==============>                                            (1 + 3) / 4]
[Stage 2:=============================>                             (2 + 2) / 4]
[Stage 2:============================================>              (3 + 1) / 4]