Data protection in clouds and digital platforms is our core mission at Alice&Bob.Company. We try to secure every aspect of your cloud infrastructure to enable you to deliver your applications fast, reliable and as secure as possible. One of the most important steps to reach these goals is to start at the mindset of everyone involved. Security should not be thought of as a state but must be thought of as a continuous process. Even when following all best-practices and having the most sophisticated intrusion prevention system in place, a system still can succumb to unknown 0-day exploits, fatal bugs or simply human oversight. A strict and well maintained security posture might minimize possible damages, but having a plan how to deal with breaches is nevertheless absolutely necessary (albeit hopefully never needed!)
So by acknowledging that there can be breaches, we now have to plan ahead in how to respond and shut these down as fast as possible. If you take a look around, you will notice that the average discovery and response time in the industry is often measured in days … which is very scary! For everyone else, this might have a plethora of more or less understandable reasons, but since we are cloud native and have a well equipped tool belt at our disposal, everything more than minutes is not acceptable.
We will use several tools and APIs offered by AWS to build an own automated incidence response process. We will automatically detect breaches, isolate compromised instances, save the current state for evidence, shutdown the instance, spin up a dedicated forensics machine and attach the infected volumes for further investigation.
The first step is detection, powered by GuardDuty. GuardDuty offeres a central place to manage and detect a broad range of threats, from spam activities to crypto mining. GuardDuty scans CloudTrail Logs, Flow logs and DNS logs without any configuration on your side necessary. If any of your instances should produce a finding, GuardDuty can send an event to CloudWatch.
The main step is now to act on this intelligence. We will use an AWS Lambda function to automatically respond to this incident.
At first we isolate this VM by stripping all security groups from it and assign it a security group explicitly denying all exit and inbound traffic, except from our sources needed for forensic purposes:
region = 'eu-central-1'
ec2 = boto3.client('ec2', region_name=region)
ec2_res = boto3.resource('ec2')
instance = ec2_res.Instance(instance_id)
security_group_id = <your isolation SG>
def lambda_handler(event, context):
#isolate instance, obtain instance id from event msg
try:
sg_resp = instance.modify_attribute(Groups=[security_group_id])
print("Assigned isolation SG {} to {}".format(security_group_id, instance_id))
The next step is to crash the machine immediately to prevent any further modification, especially of any process or mechanism which might try to cover its tracks as soon as it loses contact to it’s C&C server or the host shutting down. With only slight modifications we can prepare an instance to react to this sudden crash in a way we need — capturing the current state while preventing any further manipulation. The AWS API offers the option to send an interrupt to initate the crash. During all of this, we can already start to spin up a replacement instance to keep up operations.
diag_resp = ec2.send_diagnostic_interrupt(InstanceId=instance_id)
#allow some grace period for writing the kernel and mem dump
time.sleep(30)
stop_resp = ec2.stop_instances(InstanceIds=[instance_id])
print("Stopping VM {},preventing reboot".format(instance_id))
instance.wait_until_stopped()
This sends an non-maskable Interrupt (NMI) to crash the instance. The detailed behaviour is configurable but the most important is to have a tool to capture the current state. kdump is a tool which is capable of doing this. It allows us to load an additional dump-capture-kernel during boot time which takes over when the main OS crashes. This is a very useful debugging tool in general, but especially helpful for us to be able to crash the system and still write the current state to disk. For further details about the diagnostics interrupt, we can recommend the AWS blog.
Now the compromised instance is shut down and isolated. The time it took from the incident notification until the isolation and shutdown of the instance were merely seconds. The next step is to learn how to prevent this event from ever happening again. We will prepare the machine and the data to be analyzed by a forensics team. By first creating a snapshots of the compromised AMI volumes, we make sure to secure all evidence for investigation. The original volume can then be detached from the VM:
#create snapshots
ebs_list = instance.block_device_mappings
for ebs_dict in ebs_list:
vol_id = ebs_dict['Ebs']['VolumeId']
volumes.append(vol_id)
snap_id = ec2_res.create_snapshot(VolumeId=vol_id)
print("Created volume snapshot")
#detach volumes
for v in volumes:
vol = ec2_res.Volume(v)
vol.detach_from_instance()
print("Detached volumes {} from {}".format(volumes, instance_id))
Finally we can spin up a dedicated forensics VM, preloaded with all tools needed and attach the volume.
ami = '<your forensics ami>'
iamRole = {'Arn': 'arn:aws:iam::0123456789:instance-profile/ForensicsProfile', 'Id': 'someID'}
forensic_vm = ec2.run_instances(ImageId=ami, SecurityGroupIds=[ security_group_id ], InstanceType = 't3.micro', MinCount=1, MaxCount=1,'IamInstanceProfile': iamRole)
forensic_vm_id = forensic_vm['Instances'][0]['InstanceId']
fm = ec2_res.Instance(forensic_vm_id)
fm.wait_until_running()
fm.attach_volume(Device='/dev/sdf', VolumeId=volumes[0])
Forensics experts can now login to the machine, mount the attached volumes and start their investigation.
To sum up: We used a small set of ready to use AWS services, built-in integration and only minor preparation of instances to:
- – detect compromised instances
- – isolate compromised instances
- – prevent any further manipulation
- – securely store all evidence
- – spin up a dedicated forensics instance to evaluate captured evidences
- – spin up a replacement instance to continue operation
in a mere couple of minutes.