MongoDB导出场景查询优化

引言

前段时间遇到一个类似导出数据场景,观察下来发现速度会越来越慢,导出100万数据需要耗费40-60分钟,从日志观察发现,耗时也是越来越高。

原因

从代码逻辑上看,这里采取了分批次导出的方式,类似前端的分页,具体是通过skip+limit的方式实现的,那么采用这种方式会有什么问题呢?我们google一下这两个接口的文档:

The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

简单来说,随着页数的增长,skip()会变得越来越慢,但是具体就我们这里导出的场景来说,按理说应该没必要每次都去重复计算,做一些无用功,我的理解应该可以拿到一个指针,慢慢遍历,简单google之后,我们发现果然是可以这样做的。

我们可以在持久层新增一个方法,返回一个cursor专门供上层去遍历数据,这样就不用再去遍历已经导出过的结果集,从O(N2)优化到了O(N),这里还可以指定一个batchSize,设置一次从MongoDB中抓取的数据量(元素个数),注意这里最大是4M.

/**
     * <p>Limits the number of elements returned in one batch. A cursor 
     * typically fetches a batch of result objects and store them
     * locally.</p>
     *
     * <p>If {@code batchSize} is positive, it represents the size of each batch of objects retrieved. It can be adjusted to optimize
     * performance and limit data transfer.</p>
     *
     * <p>If {@code batchSize} is negative, it will limit of number objects returned, that fit within the max batch size limit (usually
     * 4MB), and cursor will be closed. For example if {@code batchSize} is -10, then the server will return a maximum of 10 documents and
     * as many as can fit in 4MB, then close the cursor. Note that this feature is different from limit() in that documents must fit within
     * a maximum size, and it removes the need to send a request to close the cursor server-side.</p>
*/

比如说我这里配置的8000,那么mongo客户端就会去默认抓取这么多的数据量:

经过本地简单的测试,我们发现性能已经有了飞跃的提升,导出30万数据,采用之前的方式,翻页到后面平均要500ms,总耗时60039ms。而优化后的方式,平均耗时在100ms-200ms之间,总耗时16667ms(中间包括业务逻辑的耗时)。

使用

DBCursor cursor = collection.find(query).batchSize(8000);
while (dbCursor.hasNext()) {
  DBObject nextItem = dbCursor.next();
  //业务代码
  ... 
  //
}

那么我们再看看hasNext内部的逻辑好吗?好的.

    @Override
    public boolean hasNext() {
        if (closed) {
            throw new IllegalStateException("Cursor has been closed");
        }

        if (nextBatch != null) {
            return true;
        }

        if (limitReached()) {
            return false;
        }

        while (serverCursor != null) {
            //这里会向mongo发送一条指令去抓取数据
            getMore();
            if (nextBatch != null) {
                return true;
            }
        }

        return false;
    }


    private void getMore() {
        Connection connection = connectionSource.getConnection();
        try {
            if(serverIsAtLeastVersionThreeDotTwo(connection.getDescription()){
                try {
//可以看到这里其实是调用了`nextBatch`指令        
initFromCommandResult(connection.command(namespace.getDatabaseName(),
                                                             asGetMoreCommandDocument(),
                                                             false,
                                                             new NoOpFieldNameValidator(),
                                                             CommandResultDocumentCodec.create(decoder, "nextBatch")));
                } catch (MongoCommandException e) {
                    throw translateCommandException(e, serverCursor);
                }
            } else {
                initFromQueryResult(connection.getMore(namespace, serverCursor.getId(),
                                                       getNumberToReturn(limit, batchSize, count),
                                                       decoder));
            }
            if (limitReached()) {
                killCursor(connection);
            }
        } finally {
            connection.release();
        }
    }

最后initFromCommandResult 拿到结果并解析成Bson对象

总结

我们平常写代码的时候,最好都能够针对每个方法、接口甚至是更细的粒度加上埋点,也可以设置成debug级别,这样利用log4j/logback等日志框架动态更新级别,可以随时查看耗时,从而更能够针对性的优化,对于本文说的这个场景,我们首先看看是不是代码的逻辑有问题,然后看看是不是数据库的问题,比如说没建索引、数据量过大等,再去想办法针对性的优化,而不要上来就撸代码。

标签:Mongodb 发布于:2019-11-13 02:13:02