Introduction to Graphics OptimizationNavigation |
VGO Has ChangedSubmitted by epreisz on Sat, 11/17/2007 - 03:11.
VGO Has Changed A disruptive technology is one that disrupts the way we live and perform our daily tasks. The TV and phone are two examples of disruptive technologies that affect the way we live every day. Many companies claim that their products are the most disruptive, but it’s hard to make the claim that any technology, in the past 20 years, has been nearly as disruptive as the personal computer. Our reliance on personal computers drives R&D budgets that influence the development of new, more complex, and more importantly, faster processing hardware. The evolution of microprocessors resulted from the demand for more processing power; microprocessor developers have continually shifted their development practices to achieve this goal. Chipmakers, focused on processing power in the past, have shifted their goals from processing horsepower to instruction level parallelism. Currently, their focus is on multiple cores. Moore’s law, which defines the growth rate of transistor numbers, is no longer the measure by which we predict performance. Memory architectures, often overlooked by those who wish to optimize, evolved to accommodate the blazing power of modern processors. Maintaining cache efficiency is extremely important when considering application performance. The tools that we use to both optimize and compile our programs are not immune to change. In fact, most of the tools we use today to optimize a game weren’t available to the public ten years ago. Advancements in technology make it is unlikely that anyone other than an expert assembly programmer can produce more efficient code than what a modern compiler can generate. As hardware evolves, so must our techniques for optimizing. The biggest single change in video game optimization is the use of hardware graphics cards. Adding a graphics card to a computer essentially adds another core -the GPU. Optimizing for multiple cores is very different than optimizing for one. Also, the GPU, is a processor that uses a different architecture than that of a CPU. Video Game Optimization and Assembly To some, when you mention optimization, immediately think about assembly. And although there is a place for assembly in optimization, our reliance on assembly optimization is far less than it was only ten years ago. There are several reasons we are less dependent on assembly for optimization, and those reasons are only likely to continue in the future. The days of hand coding assembly for performance aren’t gone, but they are waning. 3rd party code First, many of our bottlenecks and hotspots reside in code we didn’t write. Years ago it wasn’t uncommon to write your game with little help from outside APIs. In fact, for a long time, there were no other options and no novel hardware to access requiring an API. Now, however, it is common to use many different APIs and many different pieces of hardware. With many APIs, there is no publicly available source code. This is especially true of graphics APIs. A large majority of our existing optimization opportunities lie in code we didn’t write. If you don’t have access to the code, how is it possible to optimize with assembly? Also, firmwear, software embedded in our hardware, is read-only, and we have no means for changing it (nor should we). Over time, code that we wrote in software migrated to specific hardware that can calculate the results much faster than we can in software. Excluding the programmable pipeline, we have lost direct control of a lot of the rendering code that drives our games. There are pros and cons to this fact, but the pros, blazingly faster execution, far outweighs the control we once had. Small programs, called shaders, are exceptions to the control that we have lost. Shaders expose sections of the hardware that are crucial to creative rendering techniques and give us and opportunity to adjust the way our graphics pipeline operates. To recap, in the past, we could optimize our games using assembly because we had access to the entire pipeline. Now, we benefit and depend on the APIs and hardware that keeps us from optimizing with assembly. The software and hardware we use, not the software and hardware we write, controls a large percentage of our games run time. If we didn’t write the code, or don’t have the source, then optimizing using hand coded assembly is not possible. Assembly and Cross-Platform Our increased dependency on APIs couples with the increased occurrence and use of 3rd party game engines. The development costs of creating and maintaining a game engine are staggering. And although there are many who jump head first into this bottomless pit of work, there are equally as many who are consolidating their efforts by sharing and purchasing game engines. A robust engine requires a robust, elegant design. Of the engines that are available, many of them are cross platform. In fact, the design required to support a cross-platform engine may be sufficient enough to proclaim that the engine is following a well developed design. This design will designate platform specific areas of the engine where it is safe to use single platform code. When developing for cross-platform engines, we loose our ability to use many of the instructions included in assembly. Assembly code is not cross-platform code. The platform dependent sections of a cross-platform engine are usually not areas that are critical for performance, so even though there are sections where we can optimize with assembly, it is usually not required to do so. Game engine prices, on the whole, are dropping. As they do, more and more people will be likely to develop using a 3rd party engine. If those engines are cross-platform, and the developer wishes to maintain its ability to be cross-platform, then optimizing with assembly is less of an option. Optimizing Compilers The term “optimizing compiler” isn’t spoken very often any more. Why? Because most compilers optimize. Since there is no need to differentiate between a compiler and an optimizing compiler, we don’t. It is very difficult to hand generate assembly code that is more optimal that what the compiler generates. A recent optimization to Microsoft’s compiler in Visual Studio 2005, is whole program optimization. Not only can this compiler optimize code inside of an object file, it can also optimize their interactions between other object files. This optimization, which occurs at link time, can optimize the code you write by percentages as large as 20% by simply choosing a flag inside the project’s settings. Rising Tides When optimizing, especially for the PC, we want to take advantage of optimizations that will “rise the tide for all ships”. In other words, we want to make optimizations that will increase performance for everyone. When we optimize, especially when optimizing line by line, we need to be careful that we aren’t just optimizing for that individual computer. Some optimizations on one machine will occasionally run slower on another. It can be difficult to ensure that line-by-line optimizations will benefit all configurations of software, such as an operating system, and hardware available to our gaming audience. Video Game Optimization and Parallelism A major shift in programming is occurring due to the constraints imposed on us by single-threaded code. The rise of parallelism is the result of the heat, caused by chip density, and latency between transistors. These physical elements are unsolvable by cost efficient methods. Chipmakers have done a lot to delay the impending, fundamental change in the way we program by hiding elements of parallelism in our hardware; unfortunately, the need for parallel programming now looms over our heads like the sword of Damocles. One major change that vaguely visible to programmers is instruction level parallelism. In order to delay the need for courser grained parallelism, chipmakers introduced longer pipelines and out of order execution that enables our chips to perform many of our instructions in parallel. Without realizing it, your code is performing some level of parallelism, even when you are writing single threaded code. More visible than the instruction level parallelism that occurs in our chips execution unit, are single instruction multiple data, or SIMD, instructions. These instructions, such as the ones include in SSE, SSE2, and SSE3, operate on 16 bytes of data in parallel. That means that four 4 byte instructions of the same operator can execute as if they were one instruction. Four 4 byte instructions are useful to game programmers since four dimensional vectors are the foundation of many math operations common to games. As chipmakers continue to shift their focus from increasing chip transistor numbers to increasing cores, we as programmers must continue to develop multi-threaded applications that will scale performance in concurrence with the increase in numbers of CPU cores. Writing multi-threaded code that doesn’t crash and is accurate is difficult. Writing a multi-threaded algorithm that is scalable with the number of cores is even more difficult. Memory vs. Instruction Processing The relationship between the performance of memory and instruction processing is also different. Years ago, memory systems were able to stratify the processors ability to consume data. This is no longer the case. Processor performance dictates the necessity for a multi-tiered cache system. If we use the cache correctly, which means understanding the assumptions of the chipmakers, we are able to satisfy our blazing processors. If we don’t, the repercussions can result in very significant performance robbing results. Understanding the nuances of memory, instruction processing, and parallel processing will be crucial as we move forward. If you have ever introduced multi-threading to your application and saw no performance increase, it could be due to a lack of understanding this relationship. Tools Modern optimization tools, essential to our process, have capabilities that define our process. Some tools, such as Intel’s V-Tune and NVIDIA’s NvPerfHud, give us direct access to counters on the hardware, without this access, our ability to determine bottlenecks and hotspots declines. Our modern tools also have useful visualizations. These visualizations give us multiple views into the source code, and although no tool points directly to an optimization solution, having multiple views gives us enough information to make accurate educated guesses. Graphics Cards Lastly, and arguably most important, are graphics cards. These vertex, triangle, and pixel pushing wonders drive our ability to process incredible amounts of data. It’s likely that the impact of the graphics card will far exceed that of just triangles and pixels. Graphics cards, usually clad and marketed with images to entice the gamer, also capture the attention of those who focus on parallel processors. Using a GPU correctly requires tactics and knowledge that is different from that of a traditional rendering engine that relies on a CPU. The differences are complex; however, a simplified analysis determines that a CPU is a low latency, low throughput device. The GPU is a high latency, high throughput device. This fundamental difference causes the processing of small batches of data to be less efficient than large batches. The GPU, with an architecture that is different from a CPU’s, requires us to finesse the data in a format that is suitable for parallel processing. This concept sometimes competes with the architecture of the CPU. There is an inverses relationship between a CPUs goal of maximum culling and the GPUs goal of large batches of data. For the most accurate culling, the CPU requires small groups of triangles. For the best parallelism and throughput, the graphics car requires large groups of triangles. Where your game lies on this spectrum is a difficult question. The answer is different from game to game and is dependent on the specific details to your application. The GPUs role in processing, not just in gaming, is growing, and likely to grow more in the future. The GPUs capability expands beyond that of a video game processor. By decoupling the GPU from games, we can begin to understand the possibility of the GPU to be a parallel vector processor. It most common application is to process floating point vectors in parallel. By understanding the strengths of the GPU, we are able to accelerate algorithms that can benefit from the GPUs architecture. Not only are we performing the algorithm faster, we are also freeing the CPU so that it can do more processing. The leader of a programming team should task the right person for the job. This is also true for us as programmers. By understanding our hardware, we can utilize the best resource for the task. Many engines still use code developed before the invention of TnL graphics cards. It is likely that games using those engines will not reach the performance capabilities of modern graphics cards. Cores and GPU Kernels Although retail level multi-core CPUs are less than a year old, game programmers have been developing on multiple cores for almost ten years. The reason this is true is because of our second core, the GPU. The GPU contains many kernels, which also execute in parallel. Optimizing for multi-core and parallel stages is much different than optimizing single core, serial operation. Consider the following analogy: A dealer hands a deck of cards to one person and asks them to sort them red and black. Let’s assume the task takes two minutes. If we were to optimize the process by 25%, the task would take one minute and thirty seconds. Now let’s look a the same task from the perspective of a multi-core processor. A dealer hands a deck of cards to two people and asks them to sort them red and black. Let’s assume person “A” sorts the deck in one minute and person “B” sorts the deck in one minutes and thirty seconds. The duration of the task is equal to the slowest person, not the sum of the two. If we were to optimize person “A” by 25%, the total task will still take one minute and thirty seconds. What’s the point of this scenario? Optimizing the slowest core can yield no increase in performance. In the scenario above, optimizing person “A” didn’t increase performance, but optimizing the performance of person “B” would increase performance. The results optimizing person “B” would end as soon as it is no longer the slowest person. This scenario can happen in our games as well. Consider a CPU optimization that increases the performance of collision. If this game’s slowest core is the GPU then you may see no frame rate increase. Sometimes, we can see an increase in performance by optimizing the faster core. This happens when the optimization affects a dependency in the slower core. For example, let’s assume that the slowest core is the GPU, and its cause is the processing of too many vertices. By optimizing your culling algorithm to cull more objects you would also be reducing how many vertices the GPU must process every frame. GPUs are inherently parallel. One level of parallelism in the GPU uses a paradigm that is similar to that of an assembly line. Each kernel can operate independently except for a data dependency. With the data dependency being the major exception, the relationship between the multiple cores and optimization exists within the GPU. Therefore, when we optimize the GPU, we must also determine the kernel that is causing our performance bottleneck. |
User login |