Researchers have developed a strategy that can dramatically speed up certain types of computer programs, and ensure that program results are accurate.
Their system accelerates the running of Unix shell programs, a ubiquitous programming environment that was created 50 years ago and is still widely used today. Their approach parallels these programs, which means that they divide program components into pieces that can be run simultaneously on multiple computer processors.
This allows programs to perform tasks such as web indexing, natural language processing, or data analysis in a fraction of their original runtime.
“There are many people who use such programs, such as data scientists, biologists, engineers and economists. Now they can automatically speed up their programs without fear of getting the wrong results, “said Nikos Vasilakis, a researcher at the IT team. Science and Artificial Intelligence Laboratory (CSAIL) at MIT.
The system makes it easy for programmers to develop tools used by data scientists, biologists, engineers and others. They don’t need to make any special adjustments to their program commands to enable this automated, error-free parallelism, Vasilakis added, who chairs a committee of researchers from around the world who have been working on the system for almost two years.
Vasilakis is the lead author of the group’s latest research paper, which includes MIT co-author and CSAIL graduate Tammam Mustafa and will be presented at the USENIX Symposium on Operating System Design and Implementation. Co-authors include lead author Constantinos Callas, a graduate student at the University of Pennsylvania; Jan Bilak, a student at Stagek High School in Warsaw; Dimitris Karnikis, software engineer at Arno Labs; Thurston HY Dang, a former MIT postdoc who is now a software engineer at Google; And Michael Greenberg, assistant professor of computer science at the Stevens Institute of Technology.
Decades old problem
This new system, known as PaSh, focuses on programs or scripts running in the Unix shell. A script is a sequence of commands that instruct a computer to perform a calculation. Accurate and automatic paralleling of shell scripts is a thorny issue that researchers have been facing for decades.
The Unix shell remains popular, in part because it is the only programming environment that allows the creation of a script consisting of functions written in multiple programming languages. Different programming languages are more suitable for specific tasks or data types; If a developer uses the right language, the problem can be much easier to solve.
“People also like to develop different programming languages, so composing all these elements into one program is something that happens very often,” Vasilakis added.
Although the Unix shell allows multilingual scripting, its flexible and dynamic structure makes it difficult to parallel such scripts using traditional methods.
Paralleling a program is usually difficult because some parts of the program depend on other parts. It determines the order in which the components should be run; The command is incorrect and the program fails.
When a program is written in a single language, developers have clear information about its functionality and language that helps them determine which components can be paralleled. But these tools do not exist for Unix Shell scripts. Users cannot easily see what is going on inside the elements or find out information that will help in synchronization.
Solution in a single time
To overcome this problem, PaSh uses a preprocessing step that incorporates simple annotations into program components that it thinks may be parallel. The PaSh program then tries to parallel these parts of the script during execution, reaching out to each component at the right moment.
This avoids another problem with shell programming – it is impossible to predict the behavior of a program.
By paralleling the program components “just in time”, the system avoids this problem. It is able to efficiently accelerate many elements more than the conventional method that tries to advance parallelism.
Just-in-time parallelism also ensures that accelerated programs always deliver accurate results. If PaSh occurs with a program component that cannot be paralleled (perhaps it depends on a component that has not yet been executed), it simply executes the original version and avoids creating an error.
“No matter what the benefits of performance – if you promise to do something in a second instead of a year – if there is a chance to get the wrong results back, no one will use your method,” said Vasilakis.
Users do not need to make any changes to use PaSh; They can simply add the tool to their existing Unix shell and ask their scripts to use it.
Acceleration and accuracy
Researchers have tested PaSh in hundreds of scripts, from classic to modern programs, and none of them have been broken. The system was able to run the program an average of six times faster than the unprecedented scripts, and it reached a maximum speed of about 34 times.
It also increased the speed of scripts that were unable to match other methods.
“Our system is the first to show such a completely accurate transformation, but there is also an indirect advantage. The way our system is designed allows other researchers and industrial users to create this work,” Vasilakis said.
He is keen to get additional feedback from users and see how they improve the system. The open source project joined the Linux Foundation last year, making it widely available to users in the industry and academia.
Going forward, Vasilakis wants to use PaSh to solve distribution problems – sharing a program to run on multiple computers instead of multiple processors in a single computer. It seeks to improve the annotation scheme so that it is more user-friendly and can better describe the complex components of the program.
This work was supported in part by the Defense Advanced Research Project Agency and the National Science Foundation.