Cyber attack is an unauthorized access to a computer information system or infrastructure with a malicious intent to steal sensitive document, compromise network, vandalize the resources or use the resources for further malicious actions by individuals or organizations or nations.
For this discussion, we define a high-performance compute (HPC) environment as a compute resource running Linux operating system (OS) with around at least 500 to 1000 compute nodes having approximately 4000 to 12000 compute cores. They all invariably have high speed, low latency, high bandwidth interconnect network fabric such as Infiniband. They also have attached storage arrays of the order of hundreds of terabytes to petabytes. These kinds of resources are used simultaneously by an average of 200 to 300 users mostly from academic research environments in universities at any time although they may have over 1000 overall users. The complexity of securing the system increases with number of users, as there will be more cases of lost or compromised passwords. Usually in HPC clusters only few login nodes are open to the public network as the compute nodes are in a private network. Security breach involving computer viruses such as Trojan viruses or computer worms usually associated with Windows OS will not be discussed here. We will only discuss proactive mitigating steps to minimize interruptions in the operation of the resource.
The expectation in HPC environment is that the research done are mostly open and the resources should be easily accessible and the policy should accommodate the need of researchers who are collaborating around the globe. There is a need for balance between security and convenience. Because of the convenience factor intrusion prevention is little bit harder on HPC systems and they are more vulnerable. However, there are a lot of positive benefits in operating an HPC environment in universities compared to the compute environments in financial institutions. Expectation is that typical researchers are not storing any personnel information such as social security number or private medical data on these systems. The biggest worry is that hackers may vandalize the system when they couldn’t find any useful data or use this resource to stage criminal activities such as executing a distributed denial of service (DDoS). If the users are involved in any research with private medical data then they are required to do their research on HIPAA complaint compute environment. We will not address how to set up a HIPAA complaint system in this write up as it brings additional complexity of encrypting all the research data. HPC sites typically do not have to worry about attacks such as denial of service as these kinds of attacks are usually against high volume web portals such as news organizations or government web sites.
Protecting passwords and disabling unencrypted network protocols:
In the 1990s research compute environment used protocols such as telnet, ftp where the data between remote computers are communicated in clear text format. So, it was easy for anybody with reasonable expertise to intercept the communication and read the contents. It was easy to listen to an open port and record the keystrokes of users. None of the HPC sites that we know are running these kinds of protocols anymore. The traffic among HPC systems connected through public or private network now is exclusively through encrypted protocols using OpenSSL such as ssh, sftp, https etc. Since almost all HPC resources are running some version of Linux operating system they all invariably run Iptables based firewall at the host level, which is the primary tool to restrict access to service ports from outside network. Many of them open only few ports such as port 22 for ssh. Iptables also help in operating the system when there are known zero-day vulnerabilities by isolating the resources from outside network.
Typical way the HPC systems compromised is through either users not protecting their password or using passwords that are easy to exploit such as ‘test123’. By virtue of the design of the Linux OS, the exploit at the user level is often contained local to a particular user because regular users do not have elevated privileges and they do not have access to files of other users or users from a different group. Even though the security breach through compromised password is usually contained in a user environment, they become escalated in a situation where there is a flow in the Linux kernel itself, which will allow the hacker to trigger local root exploitation and elevate the privileges. In such situation the OS needs to be reinstalled with updated kernels. Linux kernels in the 2000s had frequent kernel flows and were susceptible to memory corruption (buffer overflow), which are becoming very rare these days. Another kind of problem is if the security package itself has flows such as Heartbleed bug (heartbleed.com) in OpenSSL, which was detected in 2014 even though the bug existed for many years prior to that.
In a scenario where users are hacked, often times owners of the accounts are unaware of the fact that they have been compromised. From our experience of running an HPC cluster for the past 12 years it is often the activity of the hackers that expose or alert the system administrators of the system about possible compromise meaning if somebody just login to the system and do nothing their actions are often overlooked. But as soon as the imposter or hacker start using the resources the monitoring tools that are often embedded in Linux OS can record the strange behavior of the system and an alert system administrator can execute remedial actions. Almost always the behaviors of hackers are completely different from that of the owner of the account. Activities such as sudden burst of network activity, increased network latency, over loading the system with CPU usage, unauthorized jobs bypassing the job scheduler etc. are good indicators of possible compromise. Typically the hackers are exposed in 8 to 10 hours in such scenarios. Often times the affected systems are quarantined from outside network for forensic activities and all the logs are examined to trace the origin of attack such as time, frequency of attack, source host, source port, destination host, destination port and the protocol or application that is used in attacking the system. System will be put back to service after remedial actions are taken such as notifying the appropriate authorities if necessary, upgrading or removing the faulty application or kernel as well as any other upgrades.
More around this topic...
© HPC Today 2023 - All rights reserved.
Thank you for reading HPC Today.