February 26, 2020
Softeq helped a major US telecom company create a custom media streaming solution for digital signage. The solution allows retailers and advertising agencies to broadcast promotional content on digital displays installed in public places.
Besides digital content distribution, the system helps evaluate campaign performance. For this, the gadgets collect ad impression data for particular campaigns and content types and send the metrics to the back end on an hourly basis. An advertiser can request campaign performance data over a given period and view the metrics via the admin portal.
To store the data, we initially chose MongoDB, which creates a unique ID for every file and arranges the documents in a table, and implemented the reporting functionality using the MongoDB Aggregation Pipeline. The solution worked well until the database grew to 20 million records.
With those improvements, we managed to stabilize the system’s performance until it became necessary to scale the back end capacity to support the growing number of devices. To optimize the query speed, we needed a more scalable back-end solution.
When considering an alternative to MongoDB, we were looking for a back-end solution that would help the team override these limitations.
Eventually, we chose AWS Redshift — a fully managed petabyte-scale data warehouse service in the cloud, which is cost-efficient, ensures stable performance, and is DevOps-friendly.
An AWS Redshift cluster contains a single leader node, which processes queries and distributes tasks across compute nodes. After the compute nodes complete their tasks, the leader node merges the results and sends the data to the client. Redshift provides sort keys instead of a B-Tree, which helps narrow down search criteria, compress data to optimize storage space, and store information in columns instead of rows.
AWS Redshift is also integrated with the AWS S3 data storage solution. Thus, we can simply dump JSON data from MongoDB to JSON files, upload them to S3, and import the information from S3 to AWS Redshift. In the same way, we can upload retired data to S3 after a specified period of time.
The media streaming solution uses MongoDB to process ad performance metrics. ETL jobs are responsible for the batch loading of the metrics from MongoDB/Postgres to Redshift. Thanks to the modified_date_time column, we know when the data has been modified. To implement the ETL jobs, we used Spring Boot with the Redshift JDBC Driver. For report generation, the application constructs and directly sends queries to Redshift with the Redshift JDBC driver.
Also, we benchmarked our solution and found that the maximum possible count of stored data for a two-node cluster is 2 billion records (for our use case). In this case, it would take the system approximately five minutes to produce a report.
To compare the performance of MongoDB and Redshift, we can select cases when the database would contain 25 million records. It would take MongoDB 12 minutes to generate a report following a user’s query; AWS Redshift can do the job in approximately 20 seconds. Thus, we managed to do the same amount of data processing and analytics 36 times faster.