In-depth understanding of the Direct3D9
In-depth understanding of the graphics programmers D3D9 significant, I put some of the previous study notes summary, I hope some of my friends to help, because it is scattered notes, very complex idea, also please bear.
In fact, as long as you can perfectly understand D3DLOCK, D3DUSAGE, D3DPOOL, LOST DEVICE, QUERY, Present (), BeginScene (), EndScene () concept, even if it is understood D3D9, do not know if you have the same sense. There are several problems, if you can successfully answer even cross the border :).
1, D3DPOOL_DEFAULT, D3DPOOL_MANAGED, D3DPOOL_SYSTEMMEM and D3DPOOL_SCRATCH in the end what is the essential difference?
2, D3DUSAGE specific how to use?
3, what is the Adapter? What is D3D Device? Hal Device and Ref Device What is the difference? Device type and Vertex the Processing Type What is the relationship?
4, APP (CPU), RUNTIME, DRIVER, GPU is how to work together? D3D API synchronization function or asynchronous function?
5, Lost Device in the end what happened? Why, after the device is lost D3DPOOL_DEFAULT types of resources need to be re-created?
There are three objects in D3D they D3D OBJECT, D3D ADAPTER and D3D Device. The D3D object is very simple, is a the D3Dfunction COM object, which provides creating device and enumeration ADAPTER. ADAPTER is an abstract computer graphics hardware and software, which contains the Device. DEVICE D3D core, which wraps the entire graphics pipeline stages, including transformation, lighting, and rasterization (coloring), depending on the D3D version, pipeline there are differences, such as the latest D3D10 new GS geometry processing. The graphics pipeline all the features provided byDRIVER, while two subcategories of DIRVER, is a GPU hardware DRIVER, another software driver, which is why two typesDEVICE in D3D REF and HAL, use REF DEVICE, rasterized graphics pipeline on the CPU simulation software DRIVER, REF DEVICE from the name can be seen to the hardware manufacturers do use common sense, so it should be full software implementation with a standard feature of all DX. Use the HAL Device Runtime will use the HAL hardware layer to control GPU to complete the transformation, lighting and rasterization, but only HAL in Device hardware vertex processing and software vertex processing (REF DEVICE hardware vertex processing generally can not be used unless play tricks on the drive the such as in PerfHUD). In addition, there is a commonly used in the Software Device, users can use the DDI write own software graphics driver, and then registered into the system, can then be used in the program.
Check the system software and hardware performance.
Beginning of the program we will judge the performance of the target machine, the main process is:
Sure you want buffer format
GetAdapterCount ()
GetAdapterDisplayMode
GetAdapterIdentifier / / adapter description
CheckDeviceType / / Determines whether the specified adapter device supports hardware acceleration
GetDeviceCaps / / specify the performance of the device, the main judge whether to support hardware vertex processing (T & L)
GetAdapterModeCount / / get all available display modes the adapter specified buffer format
EnumAdapterModes / / enumerate all display modes
CheckDeviceFormat
CheckDeviceMultiSampleType
Details please refer to the DX documentation.
WINDOWS graphics system is divided into four layers:
Graphics applications
D3D runtime
SOFTWARE DRIVER
GPU
Four is divided by function, in fact, is not so clear boundaries between them, such as in fact also contains the Software Driver USER MODE RUNTIME detailed structure will not say much. RUNTIME is a very important structure, called a command buffer when the application calls a D3D API, the runtime will call into a device-independent command, then the command buffer to the command buffer in this BUFFER size is based on task load dynamic change when the buffer is full after RUNTIME make all command FLUSH to the KERNEL mode driver, and the driver is a buffer to store has been converted into hardware commands, D3D only allows the buffer up to 3 frame graphics instructions, and the Runtime and DRIVER BUFFER command to make the appropriate optimization, such as our continuous set in the program with a render STATE, we will see the following message in the debug information " Ignoring redundant SetRenderState - X ", this is the runtime automatically discarded useless set command.
D3D9 can use the QUERY mechanism for asynchronous work with the GPU, the so-called QUERY query command is used to query the runtime driver or the status of the GPU D3D9 QUERY object has three states, signaled, Building, and ISSUED, when they are in idle state after the query the state placed signaled State query of the beginning and end of the start of the query object to start recording the data required by the application, after the end of the application to specify query, if the query object is idle, were query object the query object to the signaled state.
GetData is used to get the query results, if the return is the result D3D_OK available If you use D3DGETDATA_FLUSH flag, which means that in the COMMAND BUFFER commands are sent to the DRIVER. Now we know that the the D3D API vast majority of synchronization function after the application calls the runtime just simple add to the COMMAND buffer, may be wondering how to measure the frame rate? How to analyze the GPU time? For the first question we need to look at when one is completed, which is present () function call is blocked, the answer is likely to be blocked also may not be blocked, depending on the runtime allows the number of instructions in the buffer amount, if more than the PRESENT function is blocked down, how PRESENT completely blocked the CPU progress will greatly exceed the GPU when the GPU to perform the heavy task of drawing, result in the game logic faster than the graphic display, which is obviously not enough. The determination of the GPU working hours is a very troublesome thing, First of all, we have to solve the synchronization problem, to be measured GPU time, First of all, we must make CPU and GPU asynchronous work, in D3D9 can use the QUERY mechanism to do this, let us look at the Accurately example Profiling Driect3D API Calls in:
IDirect3DQuery9 * pQueryEvent;
/ / Create the type of event query event
m_pD3DDevice-> CreateQuery (D3DQUERYTYPE_EVENT, & pQueryEvent);
/ / Add to the end of a query mark in the COMMAND buffer in the beginning of this query default CreateDevice
pQueryEvent-> Issue (D3DISSUE_END);
/ / All command in the COMMAND BUFFER emptied into the DRIVER and cyclic query the event object into the signaled state, to convert the query event when the GPU to complete all of the commands in the CB.
while (S_FALSE == pQueryEvent-> GetData (NULL, 0, D3DGETDATA_FLUSH));
LARGE_INTEGER start, stop;
QueryPerformanceCounter (& start);
SetTexture ();
DrawPrimitive ();
pQueryEvent-> Issue (D3DISSUE_END);
while (S_FALSE == pQueryEvent-> GetData (NULL, 0, D3DGETDATA_FLUSH));
QueryPerformanceCounter (& stop);
The first call to GetData use the D3DGETDATA_FLUSH flag, which means that to COMMAND buffer drawing commands are emptied into the DRIVER, GPU processing all command will check the state of the object is set signaled.
2. Device-independent SETTEXTURE commands RUNTIME command buffer.
3. Device-independent DrawPrimitive commands RUNTIME command buffer.
4. ISSUE command added to the device-independent RUNTIME command buffer.
All command to clear 5.GetData will BUFFER DRIVER, note that this is GETDATA will not wait for the GPU to complete the implementation of all commands before returning. There will be a switch from user mode to kernel mode.
Wait DRIVER all commands are converted to hardware-related instruction, and populate Driver buffer in the call to return from kernel mode to user mode.
7.GetData cyclic query query the state of the object. When the GPU to complete the instruction buffer in all Driver will change the state of the query object.
The following situations may Empty Runtime Command Buffer, and cause a mode switch:
Lock Method (under certain conditions and certain LOCK flag)
Create device, vertex buffers, index buffers and textures
Completely release the device, vertex buffers, index buffers and textures Resources
4 Call ValidateDevice
5 Call Present
6 COMMAND BUFFER full
Call the GetData function. D3DGETDATA_FLUSH
Explain D3DQUERYTYPE_EVENT I do not fully understand (Query for any and all asynchronous events that have been issued from API calls) understand that friends must have told me, only to know when GPU processed D3DQUERYTYPE_EVENT type query the added D3DISSUE_END CB mark, will check the state of the object set signaled state, so the CPU waits for the query must be asynchronous. For efficiency, it is as little as possible in the present before BEGINSCENE ENDSCENE pair, why would affect the efficiency? The reason can only guess that may EndScene cause Command buffer flush so there will be an execution mode switching, may also lead to the D3D runtime some of the operations on the managed resources. , And ENDSCENE not a synchronized method, it does not wait for the driver all commands executed before returning.
The D3D RUTIME the memory type, is divided into three kinds, VIDEO MEMORY (VM), AGP Memory (AM) and SYSTEM MEMORY (SM), all D3D resources are created among these three kinds of memory, creating a resource, we can specify The following storage flag, D3DPOOL_DEFAULT D3DPOOL_MANAGED D3DPOOL_SYSTEMMEM and D3DPOOL_SCRATCH. VM is located on the graphics card memory, the CPU can only be accessed through the AGP or PCI-E bus to read and write speed is very slow, the CPU continuously write VM slightly faster to read, because the CPU will write the VM in the cache allocation of 32 or 64 bytes (depending on the CACHE LINE length) write buffer when the buffer is full, write-once the VM; SM is the system memory, CPU read and write very fast, because SM CACHE two buffer GPU can not directly access to the system buffer, so the resources created in the SM, GPU can not be used directly; AM is the most troublesome type AM actually exist in the system memory, but this is part of the MEM CPU CACHE, means that the CPU read and write AM will write to Cache Missing before the memory bus access AM, so the CPU to read and write AM compared to the SM will be slower, but continuous write a little faster than reading, reason is the use of CPU write AM a "write-combining", and the GPU can be directly through the AGP or PCI-E bus access AM.
If we use D3DPOOL_DEFAULT to create a resource, let D3D runtime based on the specified resource use to automatically use the type of storage, VM or AM, additional backup system does not in other places, when the device is lost, the content of the resource will lose. But the system does not create alerts to use D3DPOOL_SYSTEMMEM or D3DPOOL_MANAGED to replace it, note that they are completely different POOL type create texture D3DPOOL_DEFAULT is not being CPU LOCK, unless it is a dynamic texture. But the the VB IB RENDERTARGET BACK BUFFERS Founded in D3DPOOL_DEFAULT, can be LOCK. When you create a resource with D3DPOOL_DEFAULT, if the memory has finished using the managed resources will be swapped out memory to free up enough space. D3DPOOL_SYSTEMMEM and D3DPOOL_SCRATCH are located in the SM, the difference the use of D3DPOOL_SYSTEMMEM, resource format Device performance is limited by resources are likely to be updated to the AM or VM for use by the graphics system, SCRATCH only by the runtime limit, this resource can not be used by the graphics system. D3DRUNTIME optimize D3DUSAGE_DYNAMIC resources, generally placed in the AM, but not entirely guaranteed. Why are static texture can not be LOCK, the dynamic texture but are related to the the D3D runtime of design, will be described in the the rear D3DLOCK description.
D3DPOOL_MANAGED let D3D runtime management of resources, the resources will have to be created two copies, one in theSM, one created in the VM / AM when they were placed L in the SM, the need to use resources in the GPU when D3D Runtimeautomatically copy the data to the VM resources GPU modified Runtime in if necessary, automatically updated to the SM toSM modify also UPDATE to the VM to in. CPU or GPU frequent modified data must not use managed types, it will produce very expensive synchronization burden. Lost device after RESET when the runtime will automatically take advantage of theSM COPY to restore the VM data, backup data is not in the SM will all be submitted to the VM, so the actual backup data can be far more than the VM capacity, With the growing number of resources, the backup data is likely to be swapped to the hard disk, which is the the RESET process may become unusually slow, the Runtime to each MANAGED resources have retained a timestamp RUNTIME need to copy the backup data to the VM when the runtime will be allocated in the VMmemory space, if the allocation fails, the VM has no free space, so RUNTIME LRU algorithm according to the time stamp release resources SetPriority timestamp to set the priority of resources, recently used resources will have a high priority, soRUNTIME by Priority will be able to reasonably release resources immediately after the occurrence of the release but also the probability of this situation will be relatively small, the application can also call EvictManagedResources forced to empty the VM MANAGED resources so that if the next frame is useful to managed resources, the RUNTIME need to reload, so have a great impact on performance, usually generally do not use, but checkpoints conversion, this function is very useful, you can eliminate the VM memory fragments. LRU algorithm in some cases performance deficiencies, such as drawing a frame of the desired amount of resources can not be under VM (managed), using the LRU algorithm will bring serious performance fluctuations, as in the following example:
BeginScene ();
Draw (Box0);
Draw (box1);
Draw (box2);
Draw (Box3);
Draw (by circle0);
Draw (Circle1);
EndScene ();
Present ();
The assumption that the VM can only hold five geometry data, then according to the LRU algorithm, must be emptied before the draw Box3 part of the data that must be emptied by circle0 ...... Obviously empty Box2 is the most reasonable, so this is the runtime use MRU algorithm to deal with the follow-up Draw call a good solution to the problem of performance fluctuations, but the resource is being used FRAME unit to detect, not every DRAW CALL are recorded each FRAME logoBEGINSCENE / ENDSCENE on the Therefore, in this case, the rational use of BEGINSCENE / ENDSCENE can well improve performance in the case of insufficient VM. Prompted by the DX document, we can also use the QUERY mechanism to get more information about Runtime MANAGED RESOURCE Runtime debug mode, but it seems only useful for understandingRuntime How to Manage Resource very important, but when programming these details exposed because these things are often capricious. Finally, to be reminded that not only RUNTEIME will manage Resource DRIVER also is likely to achieve these functions, we can obtain passed D3DCAPS2_CANMANAGERESOURCE Logo DRIVER resources management functions, but also can be in CreateDevice when specified D3DCREATE_DISABLE_DRIVER_MANAGEMENT to close the DRIVER resource management functions.
D3DLOCK explore D3D Runtime
If the LOCK DEFAULT Resources is what would happen then? DEFAULT resources may be in the VM or AM, in the VM, you must open up a temporary buffer in the system context to return to the data when the application will populate data into a temporary buffer, UNLOCK RUNTIME will temporarily buffered data transfer back to the VM, if not WRITEONLY resourcesD3DUSAGE properties, the system also need to start the VM copy of a copy of the original data to a temporary buffer, which is why you do not specify the WriteOnly will reduce the performance of a program. CPU write AM also need to pay attention, because the CPU write AM General WRITE COMBINING, that is to say the write buffer to a cache line, when the Cache LINEfull FLUSH AM to pay attention to in the first write data must be WEAK ORDER (graphic data are generally satisfy this requirement), is said to D3DRUNTIME and NV Dirver of a little bug, the CPU does not have FLUSH AM, GPU began to draw the resources generated error, please SFENCE such as instruction FLUSH CACHE LINE. Second try once filled with a cache LINE, or there will be an additional delay, because the CPU is required for each FLUSH entire cache LINE to the target, but if we only write LINE bytes, the CPU must first AM read the entire LINE-length data COMBINE re-flush. Third as much as possible sequential write, and random write let WRITE COMBINING have become cumbersome, random write resources, do not use create D3DUSAGE_DYNAMIC, please use D3DPOOL_MANAGED, write will be fully completed in the SM.
Ordinary texture (D3DPOOL_DEFAULT,) can not be locked, because of VM, the only through UPDATESURFACE andUPDATETEXTURE to access, why D3D not let us lock static texture, yet we lock static VB IB it? I guess there may be tworeasons, the first is the texture matrix is generally very large, and the two-dimensional texture within the GPU stored; texture within the GPU is native FORMAT way storage is not expressly RGBA format . Dynamic texture that this texture requires frequent changes, D3D will be particularly storage treat high frequency modification is not suitable for dynamic texture created with the dynamic properties described in this two cases, GPU write renderTarget, one is CPU writes thetexture VIDEO, we know that dynamic resource is generally placed in the AM GPU access AM need to go through the AGP / PCI-E bus, the speed is much slower VM the CPU access than SM AM much slower, if the resource is dynamic attribute means that the GPU and CPU to access the resource will continue to delay, such resources are best each to D3DPOOL_DEFAULT andD3DPOOL_SYSTEMMEM create a copy of their own two-way manually newer and better. Do not RENDERTARGET created toD3DPOOL_MANAGED property, so that the efficiency is very low, the reasons for their own analysis. Resources for changes less frequently is recommended to use the DEFAULT create their own manually update because the update efficiency loss than the GPU AM losses caused by continuous access to small.
Unreasonable LOCK will seriously affect program performance, completed all need to wait because the general LOCK COMMAND BUFFER front drawing instructions to return, or is likely to modify the resources being used, to return to the modification is completed LOCK UNLOCK all this time GPU is idle state, has no reasonable use of the parallelism of the GPUand CPU, the DX8.0 the introduction of a new LOCK flag the D3DLOCK_DISCARD not read the resource will only write down the full resources, such drive and Runtime with a sneak it immediately returns to application another block the VM address pointer, pointer is discarded after the UNLOCK no longer used, this CPU Lock without waiting for GPU use of resources has been completed, can continue to operate the graphical resources (vertex buffer and index buffer), this technique is called VB IB renaming (renaming).
A lot of confusion comes from the lack of underlying data, I believe that if the MS open the D3D-source, open the driver interface specification, NV / ATI display the open driver and hardware architectures, it is easy to figure out these things.