Big data has entered the physics lab, says S.Ananthanarayanan.
Research in the frontiers of physics works with very high energy reactions, which generate quantities of data that we have not encountered before. Alexander Radovic, Mike Williams, David Rousseau, Michael Kagan, Daniele Bonacorsi, Alexander Himmel, Adam Aurisano, Kazuhiro Terao and Taritree Wongjirad, from Universities in France, Italy and the USA, in the journal, Nature, review the challenges and opportunities in using machine learning for dealing with this avalanche of information
Physics, for all its success, remains an incomplete study. The standard model of particle physics, which describes the known fundamental particles, is unprecedented in the accuracy of its predictions at the very small scale. But this model does not deal with the force of gravity. Gravity has negligible effect at the very fine scale, because masses are so small, and electrical or nuclear forces are so much stronger, but the theory is not suited for large masses and distances, where the force of gravity dominates
Even at this small scale, there are loose ends which need to be tied up. The properties of the neutrino, a very light and uncharged particle and very difficult to encounter, or the Higgs Boson, the particle implicated in the inertia of matter and difficult to create, have not been observed and documented. To complete the understanding of even the standard model, we need to see how matter behaves at distances that are even smaller than the closest we have seen so far. As all matter is repulsive at the very short distance, we need to create very high energies to bring particles close enough.
This is the motivation for the high-energy accelerators that have been constructed over the decades and the greatest of them all, the Large Hadron Collider, at CERN, near Geneva. The LHC, over its course of 27 km, accelerates protons to nearly the speed of light, so that their collision would have the energy required to create new, massive particles, like the Higgs. While the Higgs itself is a rare event, there are billions of other products of the collisions, which need to be detected and recorded. The authors of the Nature article write that the arrays of sensors in the LHC contain some 200 million detection elements and the data they produce, after drastic data reduction and compression, is as much every hour as Google handles in a year
It is not possible to process such quantities of data in the ordinary way. An early large data handling challenge faced was during the mapping the 3.3 billion base pairs of the Human Genome. One of the methods used was to press huge numbers of computers, worldwide into action. This was through a program in which all computers connected to the Internet could participate and contribute their idle time (between keystrokes, for instance) to the effort
Such methods, however, would not be effective with the data that is generated by the LHC. Even before the data is processed, the Nature paper says, the data coming in is filtered so that only one item out of 100,000 is retained. While machine-learning methods are used to handle the data finally used, even the hardware that does the initial filtering uses machine language routines, the paper says
Machine learning consists of computer and statistical techniques to search for patterns or significant trends within massive data, where conventional methods of data analysis are not feasible. In simple applications, a set of known data is analysed to find a mathematical formula that fits their distribution. The formula is then tested on some more known examples and refined, so that it makes correct predictions with unknown data too. The technique can then be used to devise marketing strategies, weather forecast and automated clinical diagnoses. While powerful computers were able to process huge data and perform well, without taking too long over it, it was realized there were situations where the animal brain could do better. The animal brain does not use the linear method, of fully analyzing data, of the conventional computer, but ‘trains itself’ to interpret significant data elements and trimming its responses to data on the basis of how good the predictions are.
Computer programmes were hence written to simulate the animal brain, in the form of ‘neural networks’, or computers that behave like nerve cells. In a simple instance of recognising just one feature, the feature could be presented to a single virtual neuron. The neuron responds at random, from a set of choices. If the answer is correct, there is feedback that adds to the probability of that response, and if the answer is wrong, the feedback lowers the probability. We can see that this device would soon learn, through a random process, to consistently make the correct response. A brace of artificial neurons that send responses to another set of neurons, and so on, could deal with several inputs with greater complexity. A network like this could learn to identify an image as being that of a car or a pedestrian, for instance, and if a pedestrian, whether a man or a woman!
This architecture is now adapted in the LHC to make out which data to keep for further analysis and what data to discard. The paper says that suitable algorithms, and neural networks, have been developed for satisfying “the stringent robustness requirements of a system that makes irreversible decisions (i.e., discard of data).”
The paper says that the machine learning methods have enabled rapid processing of data that would have taken many years to process otherwise. Apart from the need for speed, the algorithms used need to be adapted to the specific signatures that are being looked for. For instance, the decay products that could signal the very rare Higgs Boson event, events that would test the standard model and neutrino physics. The paper says that the data handling demands will only increase, as the data from the LHC will grow by an order of magnitude within a decade, “resulting in much higher data rates and even more complex events to disentangle.” The machine learning community is hence at work to discover new and more powerful techniques, the paper says
Discoveries in physics have traditionally been data based. A turning point in the history of science could be when Copernicus firmly established that the earth, and the planets, went around the sun and not that sun and planets went around the earth. Copernicus came to this epochal conclusion entirely on the basis of massive astronomical data collected by his predecessor, Tycho Brahe. The data, collected with rudimentary instrumentation, was voluminous and the analysis was painstaking.
In contrast, we are now using massive computing power and sophisticated and every much more expensive data acquiring methods. While the puzzle solved by Tycho Brahe and Copernicus was basic and changed the course of science, the problem now being looked at is of a complexity and character that could not have been imagined even a century ago.
------------------------------------------------------------------------------------------ Do respond to : response@simplescience.in-------------------------------------------