performance problem when get schema from collection #163

cyjj · 2016-09-27T06:50:16Z

When I am trying to loading data from mongo to spark, I find one wired performance issue in spark UI. It shows that it split into 2 stages the first is flatMap at MongodbSchema.scala:41, the second one is aggregate at MongodbSchema.scala:47. My problem is that the first stage always get one task and one executor, which is crazy slow with some big table. Sometimes, it use 1hr in flatMap stage but only several seconds in next one. source code is blow:

 override def schema(): StructType = {
    val schemaData =
      if (samplingRatio > 0.99) add
      else rdd.sample(withReplacement = false, samplingRatio, 1)

    val structFields = schemaData.flatMap {
      dbo => {
        val doc: Map[String, AnyRef] = dbo.seq.toMap
        val fields = doc.mapValues(f => convertToStruct(f))
        fields
      }
    }.reduceByKey(compatibleType).aggregate(Seq[StructField]())(
        (fields, newField) => fields :+ StructField(newField._1, newField._2),
        (oldFields, newFields) => oldFields ++ newFields)
    StructType(structFields)
  }

It looks like just grabbing the schema from collection. I don't know why this stage is limited to one executor. Is it normal? or whether I can do something to increase the number of executor and gain higher performance. I am working with spark 1.6.2 and stratio 0.11.0
THX.

The text was updated successfully, but these errors were encountered:

bbnsumanth · 2016-12-06T07:33:11Z

Facing a similar problem,can someone update on this.In order to reduce the fetch time,I'm trying to load mongo collection using a split key on isodate field without any success.
I'm using the following config to load data:
val mongoConfig = MongodbConfigBuilder(
Map(
Credentials -> List(slaveCredentials),
Host -> mongoHost,
Database -> mongoDatabase,
Collection -> mongoCollection,
SamplingRatio -> 1.0,
WriteConcern -> "normal",
SplitSize -> "10",
SplitKey -> "created_at",
SplitKeyMin -> "2016-11-20T10:01:32.239Z",
SplitKeyMax -> "2016-11-23T10:01:32.239Z",
SplitKeyType -> "isoDate"
)
).build()

val mongoDF = spark.sqlContext.fromMongoDB(mongoConfig)

But what I find was,this query is fetching all the data into spark which is very slow because of one executor in flat map stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance problem when get schema from collection #163

performance problem when get schema from collection #163

cyjj commented Sep 27, 2016 •

edited

Loading

bbnsumanth commented Dec 6, 2016

performance problem when get schema from collection #163

performance problem when get schema from collection #163

Comments

cyjj commented Sep 27, 2016 • edited Loading

bbnsumanth commented Dec 6, 2016

cyjj commented Sep 27, 2016 •

edited

Loading