Skip to content
This repository has been archived by the owner on May 27, 2020. It is now read-only.

performance problem when get schema from collection #163

Open
cyjj opened this issue Sep 27, 2016 · 1 comment
Open

performance problem when get schema from collection #163

cyjj opened this issue Sep 27, 2016 · 1 comment

Comments

@cyjj
Copy link

cyjj commented Sep 27, 2016

When I am trying to loading data from mongo to spark, I find one wired performance issue in spark UI. It shows that it split into 2 stages the first is flatMap at MongodbSchema.scala:41, the second one is aggregate at MongodbSchema.scala:47. My problem is that the first stage always get one task and one executor, which is crazy slow with some big table. Sometimes, it use 1hr in flatMap stage but only several seconds in next one. source code is blow:

 override def schema(): StructType = {
    val schemaData =
      if (samplingRatio > 0.99) add
      else rdd.sample(withReplacement = false, samplingRatio, 1)

    val structFields = schemaData.flatMap {
      dbo => {
        val doc: Map[String, AnyRef] = dbo.seq.toMap
        val fields = doc.mapValues(f => convertToStruct(f))
        fields
      }
    }.reduceByKey(compatibleType).aggregate(Seq[StructField]())(
        (fields, newField) => fields :+ StructField(newField._1, newField._2),
        (oldFields, newFields) => oldFields ++ newFields)
    StructType(structFields)
  }

It looks like just grabbing the schema from collection. I don't know why this stage is limited to one executor. Is it normal? or whether I can do something to increase the number of executor and gain higher performance. I am working with spark 1.6.2 and stratio 0.11.0
THX.

@bbnsumanth
Copy link

Facing a similar problem,can someone update on this.In order to reduce the fetch time,I'm trying to load mongo collection using a split key on isodate field without any success.
I'm using the following config to load data:
val mongoConfig = MongodbConfigBuilder(
Map(
Credentials -> List(slaveCredentials),
Host -> mongoHost,
Database -> mongoDatabase,
Collection -> mongoCollection,
SamplingRatio -> 1.0,
WriteConcern -> "normal",
SplitSize -> "10",
SplitKey -> "created_at",
SplitKeyMin -> "2016-11-20T10:01:32.239Z",
SplitKeyMax -> "2016-11-23T10:01:32.239Z",
SplitKeyType -> "isoDate"
)
).build()

val mongoDF = spark.sqlContext.fromMongoDB(mongoConfig)

But what I find was,this query is fetching all the data into spark which is very slow because of one executor in flat map stage.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants