Thoughts about Threads by John Kirkley

By John Kirkley | August 09, 2017

Moore’ Law, which for decades has reliably predicted the doubling of processor performance every two years, is navigating a new set of hurdles. Conventional processors are experiencing limitations at the quantum physics level and coping with the amount of heat generated to obtain these faster speeds. In response, processor manufacturers like Intel and AMD have turned to multicore technology and parallelization to boost processor power.

In the not too distant past, software engineers did not have to be concerned with creating complex multithreaded or multicore code – they only had to write single-threaded code and then rely on the hardware vendor to refresh the architecture every two to three years. Applications ran faster and the engineer didn’t have to lift a finger.

Focus on Parallelization
All that has changed with the new, powerful architectures such as the Intel Xeon Phi series. But to take advantage of these additional cores, the processor manufacturers have had to add parallelization to both the hardware and software mix. Higher speeds made possible by lowering latency and bumping up throughput is one of the benefits. This is especially important to developers focused on cutting the wall clock time for a single job – e.g. a telephone switch application – or to those concentrating on throughput where the concern is to process the most jobs in a specific amount of time.

For the developers, this shift means they have to start thinking about how to run applications in parallel from the earliest stages of the design process. Fortunately a new set of options exist including exploiting multithreading or configuring multiple processors to run side by side. A multicore system might have individual processors devoted to specific applications. For example, a Purchase Order Processing system might have one processor or a single thread dedicated to running Word-based applications. Another might handle Excel spreadsheets and yet another take on the tasks being processed by Internet Explorer. All these individual applications can be run in parallel. Interprocess communication may be involved.

If the developer decides to develop the parallel code based on the multithreading capabilities of the system, life can become complicated. Learning how to write applications for multithreaded systems is widely regarded as one of the most difficult – if not the most difficult – tasks in software programming. Previously programmers had to deal with a single thread. Now a change to one part of a complex application designed for multithreaded processing can impact the entire system — everything must change, creating new levels of complexity.

Thread integrity and safety become issues. New threads can get out of sync with existing threads and the developer can experience significant difficulties diagnosing the problem.
Timing can be really tight – a millisecond of inconsistent behavior can spawn multiple problems that seem to take on a life of their own.

Intel, Rogue Wave Software Collaboration
Debugging is the most difficult task in this environment, setting new challenges for developers or prompting them to start using tools that greatly simplify and abstract those complexities. They can deal with different languages such as C, C++ and that old workhorse, FORTRAN. VTune from Intel is one of those tools as is TotalView for HPC, a leading debugger from Rogue Wave. The most recent collaboration of the two companies produced the TotalView Xeon Phi Debugger.

According to Rogue Wave, TotalView supports faster fault isolation, improved memory optimization, and dynamic visualization for high scale parallel and multicore applications that often involve hundreds or thousands of cores. TotalView includes a set of tools that provide scientific and academic developers with control over processes and thread execution, along with deep visibility into program states and data.

“TotalView allows simultaneous debugging of many processes and threads in a single window,” says Marty Bakal, principal product manager at Rogue Wave. “Developers can exercise complete control over program execution.” This allows developers to run, step, and halt their application in a single thread or within arbitrary groups of processes or threads. They can also work backwards from failure through reverse debugging, isolating the root cause faster by eliminating the need to repeatedly restart the application, reproduce and troubleshoot difficult problems that can occur in concurrent programs that take advantage of threads, OpenMP, MPI, GPUs, or coprocessors.

TotalView offers full support to developers giving them the ability to view, control, and debug on the Intel Xeon Phi processor. For example, TotalView provides:

Support for Native, Offload and Symmetric programming models
Support for memory debugging for both native and symmetric applications on the Xeon Phi
Full asynchronous thread control
Sharing of certain breakpoints
Support for clusters and multidevice configurations
Support for launching MPI and hybrid MPI + OpenMP natively into Intel Xeon Phi

Looking Down the Road
Today, Intel is represented in the marketplace mostly by machines with 24 cores – 6 chips, each with 4 cores. But speculative road maps predicting processor evolution over the next few years, forecast machines with 100 cores. Included are some major speed bumps that will have to be rectified. For example, a developer may take a piece of conventional software, do all the right things to make it run efficiently in parallel, debug the entire system and drop it on an 8 core machine where it runs beautifully, taking advantage of all 8 cores. But when ported to a 96 core machine, everything cranks to a halt. This problem will be solved, but there is never a lack of challenges – e.g., new bottlenecks that will have to be dealt with spawned by I/O and data access complexities.

Here are some other random scenarios that are likely to show up in the near future. Plus a few tips:

Systems with high-speed backplanes will be architected in which all the racks share the backplane so that it appears to be one big box. The applications will function as if there were 128 cores available on one machine.
Future software solutions will use existing applications such as MPI and OpenMP. OpenMP that will hand off the threading tasks to the compilers, easing the developer’s task.
Tip – Think about parallelism early in the development cycle. In order to take full advantage of its power and complexity in the latter stages of design.
Investigate all the available tools early in development phase as well, preferably at the beginning of the design stage.
And finally, be sure to get a thorough grounding in parallel programming basics, including threading and multiprocessing. There is a whole new world of software development unfolding and parallel processing is at its heart.

# # #

This article was produced as part of Intel’s HPC editorial program, with the goal of highlighting cutting-edge science, research and innovation driven by the HPC community through advanced technology. The publisher of the content has final editing rights and determines what articles are published.