In the 'flat plane' case, it's simply that Z = Z0 (the orthorectified image is a kind of flat cloud). This is clearly a quite strong approximation as complicated 3D structures in the image won't be properly re-projected in the orthophoto. Sadly without additional information such as a proper underlying 3D model we can't generate a 'clean' orthophoto (and we don't have enough keypoints to generate such a model). Hopefully It works very well for terrains and convex shapes (luckily most buildings are, etc.).
And as I already said before I also think that you we could use something else than the collinearity equation:
I just remember that I didn't find the proper way to do this at the time I developed this (maybe I was simply tired and I missed something simple!). What we want is only a way to project a 3D point in the 2D image. If you find the way to do this without the collinearity equation I'd be happy to update the code.Oh when I was saying to use the transformation explicitly I wasn't thinking necessarily to use the collinearity equation. What we want is only to deduce the 2D pixel associated to a 3D point. So I guess that projecting the 3D point in the camera referential then using the focal and other intrinsic parameters could do. It's just a guess though ;)