Sandfly Scaling Guide

Introduction

Most multi-tier applications benefit from scaling to allow it to function more effectively and reliably, Sandfly is no exception. An overloaded Node will eventually prevent scans from completing or an under-scoped Server could cause a degraded or complete loss of service. We highly encourage Sandfly users to scope and build out their Sandfly deployments sufficiently to accommodate your operating environment and technical requirements. The information provided below will aid you in accomplishing this.

Scaling Goals

In general, the overall goal is to not have any loss of service or data. While Sandfly is designed to be robust, impacting situations can occur … such as overwhelming the node's ability to clear its queue due to insufficient node resources, or by not having enough CPU and RAM on the server to process and/or store results data, or simply having the container partition becoming full due to the quantity of results in relation to the data retention period, as just a few examples. These and other cases can be minimized by architectural choices.

What that means from a quantitative perspective is dependent on various factors which predominantly consists of:

  • How many active hosts are available to scan?
    • Each active host is essentially a multiplier of processing and data storage.
  • How many enabled Schedules exist and what percent of hosts are included?
    • The more scans that are run, the more data that will be produced, processed, and stored.
  • How many Sandflies run per Schedule on average and of which Types?
    • Some Sandflies take longer to run and/or produce more data than others.
    • See the Sandfly Types documentation for details about the different types.
  • Are scanned hosts on the same network segment, shared storage, and/or physical hardware?
    • Scans simultaneously occurring on endpoints with shared resources, especially filesystem based, will increase resource use on the common component.
  • Are scanned hosts physically located together or are they geographically and/or service provider distributed such as in multiple data centers?
    • Named Queues benefit these environments, but require additional nodes.
    • See the Named Queues documentation for additional details.

The answers to those questions should be used in the scaling calculations and configuration decisions.

👍

TIP: Run Only the Sandfly Stack on the Server for Security

For the security of the scanned systems, in addition to the reliability and scaling of Sandfly, we highly recommend that the Sandfly hosts run only the Sandfly stack and no other applications.

Server

Sandfly uses a single server which supplies the web interface, REST API, and database.

An under-powered server could have database timeout issues. Or, if the User Interface is taking a long time to load data, then there could be too few resources and be in need of an upgrade of the RAM and/or adding additional CPU cores.

Server Resource Requirements

The Sandfly stack consists of two Docker containers, one for the API and web interface and one for the local database.

The absolute minimum Server system resource requirements:

  • 4GB of RAM
  • 2 dedicated CPU cores
  • 20 GB of disk space (includes the host Operating System (OS))
    • SSD drives are recommended for better application performance.

Each system resource will need to be scaled up independently, where and when it is appropriate for your unique environment.

Server Scaling

The main focus when scaling the Server is to try to avoid the database from becoming overloaded or for it to go into a blocking state. In most cases scaling will be done at the host level, therefore running the server in a Virtual Machine (VM) will allow for easy system resource changes as the needs arise or grow.

Server scaling areas to take into account:

  • For database performance
    • Server RAM size should be at least as large as the index size of the results_json GIN index, ideally with generous headroom.
      • Shell command to quickly determine the index size:
      • docker exec -it sandfly-postgres psql -U sandfly -c "SELECT pg_size_pretty (pg_indexes_size('results_json'));"\`
        
    • Additional RAM is typically more important for keeping data cached, but the CPU is used heavily during actual work. Monitor both and augment when either one reaches its limits.
  • For processing incoming results
    • The CPU is important along with enough RAM to hold big result sets as they arrive before being saved into the database. In other words, as you scale nodes or the number of hosts that are being scanned, Sandfly will reach the point where it will need increased CPU cores and RAM on the server.
  • For disk space
    • Partition sizing applies to the location that contains the Docker data.
      • By default that path is: /var/lib/docker/
    • The vast majority of the database size will come from scan results data.
    • A single scan pair (Sandfly + os_identify) generates approximately 2k of data.
    • The aggregated size varies greatly, depending on the these two factors:
      • Scan schedules, which includes:
        • The run frequency of enabled schedules
        • The percentage, types, and quantity of selected sandflies
        • The quantity of included, active hosts per scan
      • Data retention period:
        • Between 1 and 31 days, depending on the license and settings
        • Data deletion applies to all scan results after the defined period
    • Should longer periods of data retention be required for any reason and/or at any level of detail, we recommend that the data is forked into an external data store via any of our available replication / forwarding methods.

Server Schedules

Sandfly Schedules are the largest source of data generation as they drive automated scans.

For both scaling and functional awareness, scan schedules will always run at their scheduled time. Thus, if you have two 500-host schedules that trigger at the same time, you will end up with 1,000 hosts in the task queue. However, hosts will always be de-duplicated (e.g. if schedule 1 adds host A to the queue, when schedule 2 runs and it tries to also add host A to the queue, schedule 2 will just skip host A for that run).

Because of those points, it is important to plan your Schedules with sufficient forethought in order to minimize the chances of scans not completing and to not overload the Nodes.

Nodes

Sandfly Nodes allow the application to connect to Linux hosts over SSH to perform agentless scans and report the results back to the server. Each Node container has the capacity for 500 scanning threads and can easily queue and scan many times this number of hosts during operation.

Node Resource Requirements

The scanning nodes consist of one or more Docker containers. Should the host's available memory allow it you can, and in most cases should, run multiple Node containers on a single system instance.

The absolute minimum Node system resource requirements:

  • 2GB of RAM
  • 1 dedicated CPU core
  • 5 GB of disk space (includes the host Operating System (OS))
    • A SSD drive is recommended for the best performance

That configuration provides the capacity to run 2 Node containers simultaneously.

  • Each container provides 500 scanning threads.
  • Provides a total of 1000 scanning threads, which is sufficient for smaller deployments that do not require host-level redundancy.
  • After reserving the first GB of memory for the base OS, a ratio of 2 containers per 1 GB of memory thereafter is a reasonable ratio for scaling consideration.

Node Scaling

The main Node scaling goal is to not have the overall rate of scan additions into the task queue to overwhelm the ability of the node(s) to clear the queue out. Thus, the ratio of Node threads to scanned hosts does not have to be 1:1.

In fact, there is nothing wrong with having 800 tasks in the queue but only 500 threads of node capacity available; as each task finishes, the next waiting task will be picked up by the next node thread that becomes available.

This is an important factor to consider with the overall architecture, in that it may be gentler on your endpoint hosts and network, especially if they are all running on the same VM servers, shared storage, etc.

Sandfly Nodes can be scaled in 3 different ways, which are largely influenced by your environment and technical requirements:

  • Vertical - The most basic form of Node scaling involves adding more memory in order to allow for additional Node containers to run on a single system. Use of only vertical scaling is the least redundant form of scaling, however, for low-impact and/or small deployments it usually is sufficient.
  • Horizontal - For reliability and redundancy we recommend having 2 (or more, as appropriate) Node systems available, preferably on different infrastructures. Should a Node fail for whatever reason the remaining Node(s) can continue scanning. These nodes can also benefit from the Vertical scaling method as well. Just keep in mind the points above so as to not over / under-architect.
  • Named Queues - This scaling direction is beneficial for distributed, segmented, and/or large environments. Sandfly has the ability to deploy scanning nodes inside multiple isolated networks and communicate to the central server. This configuration allows you to deploy Sandfly at multiple cloud providers, remote offices, internal networks and other layouts while allowing all nodes to be controlled from a central point.
    • Should your environment fall within this last layer of architecture, we recommend the use of Horizontal and Vertical scaling for each Named Queue.

Node Charts

Conceptual Minimal Scaling per Node Host (without container redundancy)

Scanned HostsMinimal CPU CoresMinimal Total Memory (GB)Minimum Node ContainersTotal Scanning Threads
100121500
1,0002421,000
10,0004863,000

CHART NOTES:

  • Not suggested for critical infrastructure, the data is provided mainly for purposes of scaling considerations.
  • Does not consider the use of Named Queues.

Starting Conceptual Minimums (provides both container and host-level redundancy)

Scanned HostsNode HostsMinimal CPU CoresMinimal Total Memory (GB)Minimum Node ContainersTotal Scanning Threads
1002121500
1,00022621,000
10,000342484,000

CHART NOTES:

  • Should one of the multiple nodes fail, the remaining node(s) need to be able to maintain the full load.
  • Does not consider the use of Named Queues.

Measuring

Finally, the use of system monitoring tools that regularly collect system resource data, such as CPU load, memory use, disk I/O, network use, etc, on all of your hosts will aid your ability to scale, fine tune your configuration, and track down application affecting issues.

'Measure twice, cut once'

Aside from generally being able to be alerted for maxed-out resources, which will cause problems on any host, the same data can be analyzed when less impacting situations occur, such as Sandfly scan timeouts. Additionally, this data creates baselines that are useful for seeing before and after conditions, such as changes caused by increasing Sandflies quantities or adding additional Schedules, for a couple of examples.

Conclusion

By applying the above best practices and scaling information, Sandfly can be set up to run reliably in any size of environment.

Should you require further assistance with scaling and/or architectural configurations, please reach out to your Sandfly representative.