2016 Yunqi Conference | Ali cloud Tang Hong: flying, to the world to an answer

You know that to the left is a debauchery, and to the right is the Bohai Sea Castle, but it is impossible to know where to go and will ultimately give you a better future.

You know that you should marry your loved one on one knee, but you don't know how to get along in the future every day and night before you can finally live with her.

Each road leads to more roads, and each choice brings more choices. We are standing in the same place and relying on "a lot of truth" that we know and trying to "live this life" may cause God to laugh.

This is the predicament that each of us faces. Our sophisticated brain can make the most favorable judgment for ourselves in simple logic, but in the face of innumerable superimposed choices, it is powerless to be like the wind and dust.

Because the choices in the world and the answers they lead to are more than the stars of the universe.

I give you thousands of small stones of different shapes to fill you with your backpack. Please tell me how do you know which combination of stones can make the backpack full?

Tang Hong, chief architect of Alibaba Cloud's system, uses this seemingly understated problem to describe the ultimate meaning of Alibaba Cloud.

Chief architect of Alibaba Cloud System Tang Hong

In theory, the combination of all stones is exhausted and the answer can be answered. But the fact is that, according to the computational model of the Turing machine, the computational complexity and the number of stones available for selection are exponential. For every stone added, the amount of calculation doubles.

For this answer alone, the required computing power goes beyond the sum of all human computers. Philosophically, this is just like our choice of life time: Unless you try every possibility, you cannot know the correct answer, and God will not try every possible opportunity for you.

This fact is awe-inspiring.

However, in the eyes of Tang Hong, "Flying Sky" - the core operating system of Ali cloud, is precisely for these "final answers".

Find "that door"

In the classic sci-fi movie “2001 Roaming Space”, an ancient monk threw his bones to the blue sky, and immediately turned into a space shuttle in space.

This scene is hailed as the classic lens of science fiction movies, leaving countless people with tears. However, when you think about it, launching a spaceship into space is not the same principle of throwing a huge bone into the higher sky.

An action that a primitive person will do, when its body volume is infinitely expanded, becomes a feat that can be achieved only by high-tech that will take thousands of years to come.

The work done by Flying, when it comes to the "atomic" level, is sorting and counting. Take Taobao as an example, that is, analyze how many products are sold and when they are sold separately. Once this kind of work becomes large-scale, there is a correlation between various data, it is difficult to get the correct answer in a short time.

Calculating hundreds of millions of sellers and buyers' information does not result in simple inventory and sales.

Through data analysis, users can search for the products they want with the fastest speed.


Through the data association, the user's gender preference can be judged, and accurate product recommendation can be made according to the scene.


Through data integration, it can be judged whether a person has a credit stain, so that in the financial products, the loan amount for the applicant is determined and the bad debt rate is controlled.

The functions we enjoy in Taobao, Tmall, and Ant's gold suits are the miracles that "large-scale computing" has achieved. The "1+1" pediatrics have inadvertently become "a big data" that we fear.

The continuous dispatch of these "big data" can produce the "God's Hand" effect.

Large-scale task scheduling is very complicated. For example, taxi planning for taxis, aircraft scheduling for airports, and even Wal-Mart's inventory management are not based on the human brain. Because every scene has almost unlimited possibilities and choices.

Tang Hong said.

At this point, we encountered the problem of "backpacking stones."

The well-known asymmetric security encryption technology, today's popular blockchain technology, is based on the assumption of computational intractability. Of course, flying can not break through the limits of calculation. So we can only give an approximate solution, not a theoretical optimal solution.

There is only one optimal solution and there are infinitely more approximate solutions. How to find the most suitable approximate solution in the vast sea of ​​solutions is where the Flying Blades are located.

"From the birth of Flying, what we have been doing is this," he said.

Because the business and data of different industries have their own characteristics, this makes "flying" computing clusters can use the best steel on the cutting edge. The use of models and algorithms to remove some of the impossible "subsets", so that the final solution can be reduced to a certain extent. But even so, the required calculation still exceeds imagination. At this time, another set of rules and models are needed to make a choice based on "value judgment."

It's like you are in a maze of huge rooms. Each room has a lot of doors. After you open a door, you will enter a new room. The new room also has many doors. In order to find the final exit, you need to eliminate the doors that will certainly not lead to the exit, and then select the most likely door among these possible exits.

With no perfect answer, the optimization of these models and judgments is endless. This is somewhat analogous to "evolution."

However, the problems faced by large-scale computing are far more than this one. Tang Hong cited another important issue for Lei Fengnet, namely the difficulty of automation.

Calculations that go beyond the limits of the human brain must be done with automated procedures. However, after the volume of data becomes larger, it usually appears that the problem of the corner angle becomes a serious situation, that is how to ensure the normal operation of automation. Events with low probability such as calculation errors, failure to communicate commands, and stuck systems will almost certainly become inevitable due to the sheer size of data and computing clusters.

At this time, you need to realize perceiving and monitoring for every minute step of the calculation and find errors. However, it is not easy to implement monitoring for each step of each path of cloud computing.

For the tracking of calculations, it is necessary to “stain” each step, that is, make a mark for the system to associate. This requires embedding staining information during processing. However, each step of the system operation is different, and it is necessary to find the best embedding point according to the specific situation. What is even more difficult is that there are steps that have no way to embed information. This time, we must make inferences based on the calculation of the surrounding information.


For example, a process opens a file. A descriptor is used in the operating system to represent the file. By associating descriptors, events that occur in the operating system kernel can be associated with user processes.


All the tracking is a concrete solution to the specific problem, and it can be imagined how hard it is to pay.

He said.

These "pits" were all taken by Tang Hong and Fei Tian for seven years and they were picked up little by little. However, all of these minor evolutionary accumulations are based on Feitian's "power."

Tang Hong said that the flying system is like a huge computer, but it is a large computing cluster composed of data centers around the world, hundreds of thousands of servers, connected through dedicated lines.

Seeing here, you may suddenly realize that dispatching hundreds of thousands of servers worldwide is itself a huge challenge for "backpackers."

Tang Hong told Lei Fengwang (searching for "Lei Feng Net" public number concerned) about the birth story of Fei Tian.

At that time, Alibaba was standing on the edge of a dangerous cliff to a certain extent. While still flying in the infancy, he had a beautiful dream called "5K."

5K, a statue

Wang Jian, with a smile forever, looks thin and edgy. However, it was this man that resisted all doubts and pressures and became the hero who guarded Alibaba Cloud from a bud to a towering tree.

All people, including Tang Hong, referred to Wang Jian as a "Ph.D." That year, it was the doctor's phone call across the ocean that brought Tang Hong back from the United States.

In 2012, Feitian was still in the early stages of R&D. It could only support clusters of up to 1,500 machines, and it often suffered from customer complaints because of some bugs. At that time, many of Alibaba's data processing tasks were based on Hadoop's open source software, which had about 3,000 to 4,000 machines.

Wang Jian, Chairman of Alibaba Group Technical Committee

However, Hadoop is not designed for public cloud computing either from security or operational logic.

Hadoop's storage and computing are integrated, that is, if you need ten machines for storage, but do not need the computing power of ten machines, you still need a cluster of ten machines, which is a waste. Although such problems can be solved by other methods, the operating efficiency will be greatly reduced.


Hadoop's account system is not an account system for Internet tenants, but is more about local administrators. This determines that it cannot be sold to Internet-level users like Alibaba Cloud today.


In addition, Hadoop's high degree of flexibility allows users at the application layer to directly access the underlying files. This poses a huge security risk and it is impossible to run directly on the public cloud in the form of multi-tenant sharing.

Tang Hong told Lei Fengwang that all the above drawbacks are difficult to evade on Hadoop through patchwork. The more critical issue is that the growth rate of Alibaba’s business at that time has approached the limit of the computing power of Hadoop clusters. Once the bottleneck is reached, it must be forced to lower the level of business, or actively limit the scale of the business.

Prior to this, Alibaba had tried to replace Hadoop with Flying Sky, but all failed. At that time, Ai Ciba’s CTO Wang Jian said to all Ali’s children’s shoes that this task should still allow Fei Tian to bear, because in his heart there is a plan that is to fly the sky into a cluster of 5,000 servers. The crazy idea that made many colleagues laugh at that time was "5K."

From 1500 to 5000, it's not just about buying 3,500 more machines. Tang Hong said:

Moore's Law has given rise to an exponential increase in the performance of computer hardware over time. From 1,500 old machines to 5,000 new machines, computing power increased by 8 to 10 times. To achieve such a large-scale upgrade of software capabilities in less than six months is an impossible challenge for anyone who is familiar with the laws of software engineering.


However, the flying really did. There are a lot of companies in the world that can afford 5,000 servers, but they really have to develop their own technology to dispatch such large-scale clusters.

Why does Alibaba Cloud have to build a cluster of 5,000 machines?

Tang Hong told Lei Feng that for Alibaba, many computing tasks cannot be dismantled and must be completed in a computing cluster. Such a large number of interconnected data requires that so many machines be collaborated in a cluster.

At the presidential meeting at the end of 2012, there were still people who wanted to rely on Hadoop again. But in the end, the presidents decided to go along two paths, both Hadoop and flying. But in my opinion, this kind of decision made the technical power diverted. In fact, at that time, according to the results of the simulation experiments inside the team, I was very clear that achieving 5K was not an issue of architecture design, but a process of polishing product details. But colleagues in other departments may not think so. The two roads together will greatly divert technical power, so 5K can be made as early as one day and the debate can be concluded as early as one day.

He said.

August 15, 2013 is the date when Tang Hong blurted out. On this day, the facts declare that Wang Jian, Tang Hong, and Fei Tian’s judgment are correct. The 5K cluster went online, and two 5K clusters went online at the same time.

In the town of Alibaba Yunqi, there is a statue of 5K. This statue is smaller than expected, but it can't stop it from spreading its ideal texture to the surroundings. Many passing people stop to watch.

Flying 5K Statue in Ayunyun Town

Tang Hong told Lei Feng Network that the facts prove that Hadoop cluster will encounter insurmountable bottleneck after exceeding 4,000 machine clusters. If Alibaba chose to rely on the Hadoop system at the time, such high business growth would encounter "cliff" of computing performance, and the consequences would be hard to imagine.

Like the Boeing 747s, Intel's x86 chips, and SpaceX's rockets, all the great creations came from desperation.

Tang Hong so emotion.

"Eat your own dog food"

I remember when we were recruiting in 2012 or 2013, Tsinghua Peking University students interviewed very little, and even if many students get our offer, their first choice will be Baidu and Tencent.

Tang Hong spoke this sad fact to Lei Fengwang as a paragraph. In this year's Ali Star interview, Tang Hong asked the students why he chose to come to Alibaba Cloud. The answer was "Aliyun's cloud computing technology is leading."

He simply described the current flying:

With only 10,000 clusters, there are a total of hundreds of thousands of servers worldwide, and the daily data platform processes more than 1 PB of data.

However, he felt that none of this could explain the power of flying. What really convinced was that Alibaba was “eat its own dog food”, which was part of the important business including Taobao Tmall, including an amazing amount of instantaneous concurrent computations. "Double 11" is responsible for cloud computing scheduling by Feitian. In other words, Alibaba itself is also using the same cloud services provided to other customers without any privilege or distinction.

"I believe some of the domestic competitors have certainly not run their own services on the cloud they provide." He said.

One data that Alibaba Cloud is proud of is that 37% of websites in China are built on Alibaba Cloud.

Our expectation is that in the future, there may be only a few “computers” in the world. They are all composed of very large-scale computing clusters. At this time, computing power will be completely liberated.

If Tang Hong's predictions come true, then the calculation of "carrying stones in the backpack" will lead to a solution closer to the truth than it is now. Maybe we can never find the right answer, but there is no doubt that the machine has always approached the correct answer to a small step, it is the glory of people.

Give the world an answer

Each year, Alibaba Cloud will hold Yunqi Conference in its hometown Hangzhou to announce the latest developments in Alibaba Cloud and Flying Sky. At this year's conference, Ma Yun delivered an enthusiastic speech.

The original machine manufacturing will become artificial intelligence, the original machine will eat electricity, and the future machine will eat data.

Ma Yun said.

It can be seen that machine learning and artificial intelligence are exactly the next move.

For people, the blasphemy behind the road, the choice behind the choice, let us feel small and fear. In order to counter the impermanence of destiny, we invented cloud computing and we invented artificial intelligence. Regardless of whether Tang Hong or Fei Tian, ​​these numerous operating chips, numerous currents flowing in the network, have only one purpose:

Give the world an answer.