MongoDB Best Practice For Pulling Large Amounts Of Data
  • 3 Minutes to read
  • Dark
    Light
  • PDF

MongoDB Best Practice For Pulling Large Amounts Of Data

  • Dark
    Light
  • PDF

Article Summary

In historical data migration / large collection(s) cases, there is a need to pull a vast amount of data from MongoDB into any of the optional target DWH availeble in the Rivery console.

This guide will walk you through the best practice and efficient way to connect your MongoDB to Rivery.

Source configuration

Extract Method:
Best practice suggestion for pulling data from your collection will be to use the “Incremental” extract mode. This mode pulls data by an incremental value that performs as an Index, that is fetching only the data that had been changed since the last run. In the Incremental extract method there are additional options:

  • Incremental Attributes:
    Set the name of the incremental attribute to run by. e.g. UpdateTime.
  • Incremental Type:
    Timestamp | Epoch. Choose the type of incremental attribute.
  • Date Range:
    • In Timestamp, choose your date increment range to run by.
    • In Epoch, choose the value increment range to run by.
    • If no End Date / Value is entered, the default is the last date/value in the table. The increment will be managed automatically by Rivery.
  • Include End Value:
    Should the increment process take the end value of the run or not.
  • Interval Chunks:
    On what chunks the data will be pulled by. Split the data by minutes, hours, days, months, or years, It is recommended to split your data into chunks in the case of a large amount of returned data to avoid a call timeout from the source side.
  • Max Rows In Batch:
    In the process of parsing the data in the file zone from BSON to a JSON file, as best practice, it is essential to add a batch max rows limit (the number of rows that will be read in a single batch before moving to the next one). In order to determine the size of the rows, you can follow the next principle:

mongo_best_1.png

After every 50% reduction, test the river and check if it is running successfully, follow this process up until you set a size that enable the river to run as expected.

  • Filter (Optional):
    Filter the data that will be pulled. The filter format will be a MongoDB Extended JSON. e.g. {"account":{"$oid":"1234567890abcde"},"datasource": "google", "is_deleted": {"$ne": true}}.

  • Limit (Optional):
    Limit the rows to pull.

  • Auto Mapping: Rivery will determine your table / query metadata. You can choose the set of columns you want to fetch from the source, add a new column that hadn’t got in Rivery determination or leave it as it is. If you don’t make any of Auto Mapping, Rivery will get all of the columns in the table.

Schema configuration

  • Converting Entire Key Data As a STRING (Optional):
    In some cases, the data is not as expected by a target, like key names starting with numbers, or flexible and inconsistent object data. A possible option to overcome this will be converting attributes to a STRING format by setting their data types in the mapping section as a STRING.
    An example for this can be the following key:value combination -
    {
    "name": "myname".
    "id": 1,
    "sites": {"www.google.com": {"count": 3},
    "www.yahoo.com": {"count": 4},
    "www.amazon.com": {"count": 4}
    }
    The mapping, in this case, would have the following structure:

mongo_best_2.png

As seen, the site’s key, in this case, has inconsistent changing values, and the values themselves (sites) are being mapped as columns. In order to deal with this situation, the mapping solution will be to define the sites data type to a STRING and parse this column by using an open JSON function on the target’s side:

mongo_best_3.png

Note:

Conversion exists for any value under that key.
Arrays and objects will be converted to JSON strings.

  • Preserve The collection’s Raw Data (For Snowflake Users Only):
    In cases when a JSON/object/Array is combined from schemaless keys, which may change between row to row, or has special chars/blanks in the keys, Snowflake users can use the Snowflake built-in ability to query complex JSON objects or arrays, via the Semi-Structured Query Mechanism. In this case, it is recommended to retype the whole column as a VARIANT column and query the needed columns in the Snowflake console / using a Logic river by any of the “Open JSON” Snowflake functions.

To do that, the “Preserve Raw Data” toggle will be activated, (can be found on the Target tab → file settings), then, an Array will be automatically added to the column mapping and by default be loaded as a variant column, in order to keep the granularity of the row itself.
mongob_1.png

The outcome result will be reflected on the mapping in the following way:
mongo_best_4.png

It is also optional to use the “expression” column mapping option in order to pull a specific column from the raw JSON data that will perform as a Key for the Upsert merge loading mode:

mongo_best_practice_key.png


Was this article helpful?