Date: Tue, 22 Oct 2002 02:39:35 -0700 (Pacific Daylight Time) From: Casey Muratori Subject: Normal casting report [long] Well, I don't really have any more items on my to-do list, which means it's time for the report. We decided to put normal mapping support into Granny about two months ago. Since then I've been working on it now and then, intermixed with other stuff. All told I think I've spent somewhere between two and three weeks worth of time on the actual code, and about one week's worth of time on integration with art tools. The other time was spent on the usual tech support, and making a modern toy renderer, since I needed that for testing the normal maps (shadows, shaders, etc., with a DX8/OpenGL/software switchable layer). My basic design was the one from the Silouette Clipping paper by Hoppes from a few year's back. I don't know if they came up with this or not, but the paper makes it sound like they did but that it's similar to something from the 1998 proceedings of "Visualization". Which I have not tried to find. It is my understanding that this is [CENSORED]-style, since I remember him saying that he shoots rays for the normal map generation, which is how this technique works. From my understanding of [CENSORED]', his is UV-parameterization based, so there's not much in common between ours. The basic process is simple, and I'm sure you all know it: you walk through the triangles of your low-res mesh, you "rasterize" them into the texture using their UV coordinates, and at each pixel, you cast rays out to the high-res mesh and see what face you hit. You barycentrically interpolate a normal from that, and that's what you write into the normal map. So I'll skip to the parts that I have not seen written anywhere. #1: Voxels I decided that the best way to speed up a scheme like this would be with voxels. The reason for this is that you are casting a ray from a piece of geometry which is, by definition, going to be very close to the piece of geometry it hits. If this weren't the case, you shouldn't be using normal maps to represent the detail, because it's not "detail" now is it. 150 seems to be about the magic speed point for whatever reason, so by make a 150x150x150 occupancy map of the high-res model at the outset. Now, much like you would THINK there would be hundreds, nay, thousands of excellent, readily accesible papers on directly voxelizing triangles. You would be wrong. In fact, there's almost none. And the one canonical one that everyone references is actually bullshit, because it does not do a conservative voxelization - ie., it does not guarantably mark all voxels the triangle passes through, much like a conventional texture mapper doesn't light all pixels that the triangle passes through, it uses a rounding rule (more on that later). So, I found I had to just invent an algorithm for directly voxelizing triangles conservatively, and it's somewhat clumsy but works quite well. Basically it first slices the triangle in half along one axis at the middle point (sorted along that axis), then it slices it along each voxel boundary along that axis. This produces a set of quadrilaterals, which I treat as two triangles. I then slice those triangles in half and then along voxel boundaries for the next axis, producing spans in the third axis. I run through the spans adding the triangles to the linked list of triangles for those voxels. The implementation of this is very much like a regular rasterizer, it's just very careful about all the edge conditions to make sure that every single voxel gets lit by every single triangle that so much as osculates with it. When adding the triangles to the list, I precompute some stuff: struct voxel_triangle { quad Plane4; int32x I1, I2; real32 P0, P1; real32 InvU1, U2; real32 CU, CV; }; that is what I actually store. This is the plane equation, the indices of the two most major axes of the triangle (I1, I2), the coordinates of triangle vertex 0 on those two axes (P0, P1), and then some terms: real32 U1 = Edge01[I1]; real32 V1 = Edge01[I2]; real32 U2 = Edge02[I1]; real32 V2 = Edge02[I2]; real32 InvCrossTerm = 1.0f / (U1*V2 - V1*U2); VoxTri.InvU1 = 1.0f / U1; VoxTri.U2 = U2; VoxTri.CU = V1 * InvCrossTerm; VoxTri.CV = U1 * InvCrossTerm; where Edge01 is vertex 1 minus vertex 0, and Edge02 is vertex 2 minus vertex 0. Basically, these terms are just the "most factored form" you can make out of the plane-equation-based triangle intersection routine, which I optimized as much as I could without giving it to Jeff or the Mike^2 for assembly sorts of things. I switched to this intersection routine from the Thomas Moeller one (I think that's his name?) because when you want to handle large numbers of rays and your triangles are static, then do a bunch of precomputation is a good thing, and the plane-equation-style intersector is much more susceptible to precomputation as far as I could tell. When it comes to actually casting the ray, it gets a bit tricky, because you have to walk the voxel grid in both directions at once, because you want the closest intersection (it's a line cast, not really a ray cast, right). The way you do this is that you have basically the "minimum t" along the ray that you will need to go to before you can change to the next voxel in each of the three axes (a Bresenham error kind of thing). So what you do is you keep two of these, and go in both directions until the t intersection you find is less than the previous minimum t you transitioned - ie., until you've moved to a voxel who's entirety is further away than an intersection you've already found - in both directions. For speed purposes, I shut down each direction as soon as this is true, since you don't want to waste one extra voxel test by making them shut down symmetrically. #2 Rasterization I think there are two schools of thought here. One is the "let me do something relatively sloppy and massively oversample", and the other is "let me do something very precise and sample one-to-one". I chose the latter, because I figured I could always do the former as a post pass over my code :) In order to have everything be 100% precise, I wanted to be able to know for each pixel in the texture map, every triangle that passed through that pixel, and the precise (possibly degenerate) trapezoid in the texture map and in world-space that resulted. Again, I thought SURELY there would be hundreds of papers on this, because hey don't you need this for antialiasing and so on, but nay. I think I must have just been looking in the wrong place or something because I didn't find jack squat. But luckily I didn't have to find jack squat, because I'd already been through this with the voxelizer, so I just did the same thing. I wrote a dicing algorithm that chops things up nicely. It's done as an abstract thing where you call it with a triangle, and it calls you back with every trapezoid fragment produced: struct fragmentor_point { triple Point; triple Position; triple Normal; }; typedef void handle_fragment(void *CallerData, int32x TextureX, int32x TextureY, fragmentor_point &TopStart, fragmentor_point &TopEnd, fragmentor_point &BottomStart, fragmentor_point &BottomEnd); "Point" is in the texture, "Position" is in world-space. To verify that everything was 100% correct, it optionally keeps a debug buffer around that you can use to draw the fragments and verify that they line up perfectly with texel / triangle boundaries. Another nice side-effect of having all this information when you go to rasterize is that if you _do_ want to oversample, you have the extra trapezoid for the pixel, so you can go ahead and cast as many rays as you want from different interpolated positions on that trapezoid, apply whatever filter you want, and so on - you are not stuck with rectilinear oversampling. Now, since I don't know jack shit about sampling, this doesn't help ME out much, but if I go learn some sampling theory then it might. #3 Map accumulation I store everything into a floating point image which is 4 + 3*n deep, where n is the number of other maps I'm sampling along with the normal map (Theo usually makes diffuse, specular, opacity, and bump - so n is usually 3), and the constat 4 is for the 3 normal components and 1 weight accumulator. All the ops are defined on this texture. Pixel values are added in modulated by the area in texels of the trapezoid that generated them (which is guarantably less than 1), and the total weight is accumulated. On float->32-bit conversion, the normals channel is normalized and the color channel is divided by the weight accumulator. #4 Tangent space computation I would like to say that I have no idea where the "standard", and by that I mean "nVidia", method of computing tangent spaces comes from. Is this what people normally use??? It's crazy! It's like several pages of code. I worked out the math on paper myself from first principles (otherwise known as the Barrett method) and got simply: (E_a is the world-space edge from vertex 0 to vertex 1, and a_u and a_v are the UV-space components of it. E_b, b_u, and b_v are the same thing but for the edge from vertex 0 to vertex 2) c = a_v b_u - a_u b_v -b_v a_v dP_dU = ----- E_a + --- E_b c c -b_u a_u dP_dV = ----- E_a + --- E_b c c Like, that's it. It's like the simplest fucking thing to figure out, and it makes perfect intuitive sense. You just start with: m a_u + n b_u = 1 m a_v + n b_v = 0 and solve from m and n. End of story. It takes like ten seconds. You can then just do it again with the opposite: o a_u + p b_u = 0 o a_v + p b_v = 1 to get the other vector, and of course you don't actually have to do the math because you already know the answer by changing the variables. Is this written up somewhere? Because somebody really should do that since the nVidia crap is just that (crap). I don't know how anybody would read it and understand what they were doing, because it's completely undocumented unless I missed one of their whitepapers on it or something. If you don't happen to know the cross-product differentiation trick, you'd be dead in the water (and I don't even remember how I knew it, but I imagine [CENSORED] must've told me at some point). OK now I'm tired and I'm going to bed, but I want to talk about edge filters so you probably haven't heard the last of me on this subject. - Casey Date: Tue, 22 Oct 2002 02:50:54 -0700 (Pacific Daylight Time) From: Casey Muratori Subject: Re: Normal casting report [long] Oh, and I forgot to include example timings. On my Athlon 1900: ------------------------------------------------ Texture: 512 x 512 MeshLow: 894 tris / 611 verts MeshHigh: 31280 tris / 16288 verts VoxelGrid: 150 x 150 x 150 (250738 records) Avg Tris/Voxel: 0.074293 (Max: 52) Avg Tris/Occ: 3.151836 (79553 voxels) 16% ( 1) VoxelizeMesh: 0.189697 [0.191041] 0% ( 2) IncludeInBounds: 0.001344 [0.001344] 12% ( 1) GenerateNormalMap: 0.140062 [1.025122] 0% ( 894) FragmentTriangle: 0.003150 [0.641243] 28% ( 17952) TSpanX: 0.344544 [0.638093] 6% ( 248206) RayCastVoxelGrid: 0.074641 [0.293549] 7% ( 965698) RayCastVoxelGrid::FChain: 0.090547 [0.090547] 11% ( 935620) RayCastVoxelGrid::RChain: 0.128361 [0.128361] 6% ( 1) FilterUncastPixels: 0.071534 [0.071534] 5% ( 2) ConvertTo32BitNormalized: 0.058837 [0.058837] 9% ( 4) ConvertTo32Bit: 0.113446 [0.113446] Total: 1.216163 (measured: 1.216172 difference: 0.000009) ------------------------------------------------ Texture: 512 x 512 MeshLow: 894 tris / 611 verts MeshHigh: 31280 tris / 16288 verts VoxelGrid: 150 x 150 x 150 (250738 records) Avg Tris/Voxel: 0.074293 (Max: 52) Avg Tris/Occ: 3.151836 (79553 voxels) 2% ( 1) VoxelizeMesh: 0.191086 [0.192569] 0% ( 2) IncludeInBounds: 0.001483 [0.001483] 2% ( 1) GenerateNormalMap: 0.124785 [7.866767] 0% ( 894) FragmentTriangle: 0.003524 [7.498649] 47% ( 17952) TSpanX: 3.791077 [7.495125] 13% ( 3972131) RayCastVoxelGrid: 1.061534 [3.704048] 14% ( 15506097) RayCastVoxelGrid::FChain: 1.125388 [1.125388] 19% ( 15239014) RayCastVoxelGrid::RChain: 1.517126 [1.517126] 1% ( 1) FilterUncastPixels: 0.071726 [0.071726] 1% ( 2) ConvertTo32BitNormalized: 0.058779 [0.058779] 1% ( 4) ConvertTo32Bit: 0.112829 [0.112829] Total: 8.059336 (measured: 8.059347 difference: 0.000011) The top one, which is 1.2 seconds, is casting a 512x512 map. As you can see, it's a 30k -> 900 tri cast. The bottom one is the same cast with 16x oversampling turned on, so you can see how the profile changes when you do tons of per-trapezoid work without increasing the rest of the work. In general, I only use 1-cast per trapezoid (which equates to one cast per pixel per triangle that touches the pixel), because those normal maps look much crisper than any oversampled version as I mentioned in previous mails. The profile is generally a "good" thing, because TSpanX (or "t spanks" as I call it informally) is where a lot of currently unoptimized tangent frame interpolation and normalization ridiculousness is going on, so if I optimized that, I think I could boost the speed quite a bit. For low-end performance, I could also speed up VoxelizeMesh quite a bit, because right now it is completely unoptimized. An average of 1 or 2 seconds per cast is not really a big deal, so I don't know how much time I'll spend optimizing things, but if I do an optimization pass over the final version, I'll post more extensive timing results. - Casey