From: pottier@clipper.ens.fr (Francois Pottier) Subject: csmp-digest-v3-036 Date: Thu, 16 Jun 1994 13:35:08 +0200 (MET DST) C.S.M.P. Digest Thu, 16 Jun 94 Volume 3 : Issue 36 Today's Topics: Apple Event example Faster Square Root Algorithm How to save alpha in PICT files? Open Transport & ASLM Sending the Mac Screen Image What are Universal Headers? The Comp.Sys.Mac.Programmer Digest is moderated by Francois Pottier (pottier@clipper.ens.fr). The digest is a collection of article threads from the internet newsgroup comp.sys.mac.programmer. It is designed for people who read c.s.m.p. semi- regularly and want an archive of the discussions. If you don't know what a newsgroup is, you probably don't have access to it. Ask your systems administrator(s) for details. If you don't have access to news, you may still be able to post messages to the group by using a mail server like anon.penet.fi (mail help@anon.penet.fi for more information). Each issue of the digest contains one or more sets of articles (called threads), with each set corresponding to a 'discussion' of a particular subject. The articles are not edited; all articles included in this digest are in their original posted form (as received by our news server at nef.ens.fr). Article threads are not added to the digest until the last article added to the thread is at least two weeks old (this is to ensure that the thread is dead before adding it to the digest). Article threads that consist of only one message are generally not included in the digest. The digest is officially distributed by two means, by email and ftp. If you want to receive the digest by mail, send email to listserv@ens.fr with no subject and one of the following commands as body: help Sends you a summary of commands subscribe csmp-digest Your Name Adds you to the mailing list signoff csmp-digest Removes you from the list Once you have subscribed, you will automatically receive each new issue as it is created. The official ftp info is //ftp.dartmouth.edu/pub/csmp-digest. Questions related to the ftp site should be directed to scott.silver@dartmouth.edu. Currently no previous volumes of the CSMP digest are available there. Also, the digests are available to WAIS users. To search back issues with WAIS, use comp.sys.mac.programmer.src. With Mosaic, use http://www.wais.com/wais-dbs/comp.sys.mac.programmer.html. ------------------------------------------------------- >From farago@onyx.ill.fr (Bela Farago) Subject: Apple Event example Date: Thu, 2 Jun 94 13:06:18 GMT Organization: ILL - Grenoble FRANCE Hi, please forgive me if it is trivial but I am programming just by hobby (well not really but this is not my main job) I need to write a rather trivial code resource which sends the equivalent of the following AppleScript tell application anApplication save document 1 end tell An Application is an application :) It is hard to digest the volume VI of IM. and I could not find simple example on ftp.apple.com. (I did find some but were not very clear to me) I am bit lost in descriptors and direct objects etc. Could anybody give me pointers to simple examples? Bela Farago (farago@ill.fr) ps: No flames please, you can always just ignore this message. +++++++++++++++++++++++++++ >From greer@utdallas.edu (Dale M. Greer) Date: 2 Jun 1994 16:54:39 GMT Organization: The University of Texas at Dallas Bela Farago (farago@onyx.ill.fr) wrote: > Hi, > please forgive me if it is trivial but I am programming just by hobby (well > not really but this is not my main job) > I need to write a rather trivial code resource which sends the equivalent of > the following AppleScript > tell application anApplication > save document 1 > end tell > An Application is an application :) > It is hard to digest the volume VI of IM. and I could not find simple > example on ftp.apple.com. (I did find some but were not very clear to me) I > am bit lost in descriptors and direct objects etc. Could anybody give me > pointers to simple examples? > Bela Farago (farago@ill.fr) > ps: No flames please, you can always just ignore this message. Here's how I did it. /**************************************************** MyCreateDocContainer */ // 07-APR-94 : Dale M. Greer #pragma segment AEvents void MyCreateDocContainer(AEDesc *myDocContainer, char *docName) { AEDesc myDocDescRec, nullDescRec; OSErr err; // Prepare to make something from nothing err = AECreateDesc(typeNull, nil, 0, &nullDescRec); // This document container just contains the document name err = AECreateDesc(typeChar, docName, strlen(docName), &myDocDescRec); // Encapsulate the object err = CreateObjSpecifier((DescType)cDocument, &nullDescRec, (DescType)formName, &myDocDescRec, true, myDocContainer); AEDisposeDesc(&nullDescRec); } /********************************************************** SaveDocumentAs */ // 07-APR-94 : Dale M. Greer #pragma segment AEvents void SaveDocumentAs(AEAddressDesc *theAddress, char *docName, char *saveName) { OSErr err; AppleEvent appleEvent, reply; AEDesc myDocDescRec, theName; err = AECreateAppleEvent(kAECoreSuite, kAESave, theAddress, kAutoGenerateReturnID, 1L, &appleEvent); // Make the doc container MyCreateDocContainer(&myDocDescRec, docName); // Append it to the appleEvent AEPutParamDesc(&appleEvent, keyDirectObject, &myDocDescRec); // Append the new filename descriptor to the appleEvent AECreateDesc(typeChar, saveName, strlen(saveName), &theName); AEPutParamDesc(&appleEvent, keyAEFile, &theName); // Send it to the Application err = AESend(&appleEvent, &reply, kAEWaitReply + kAENeverInteract, kAENormalPriority, kAEDefaultTimeout, (IdleProcPtr) MyIdleFunction, nil); AEDisposeDesc(&myDocDescRec); AEDisposeDesc(&appleEvent); AEDisposeDesc(&reply); } You get the address of the target application when you launch it, or if it is already launched, use the following routine, where "creator" is the creator type of your application. For example, this creator type is 'XCEL' for Excel, 'ttxt' for TeachText, etc. // This function originated from James "im" Beninghaus of Apple DTS /*********************************************************** AppLaunched */ // Find the process serial number of your application #pragma segment Launch Boolean AppLaunched(AEAddressDesc *theAddress, OSType creator, ProcessSerialNumber *PSN) { OSErr osErr; ProcessSerialNumber process; ProcessInfoRec procInfo; Str255 procName; FSSpec procSpec; osErr = noErr; process.highLongOfPSN = 0; process.lowLongOfPSN = 0; procInfo.processInfoLength = sizeof(procInfo); procInfo.processName = procName; procInfo.processAppSpec = &procSpec; while (procNotFound != (osErr = GetNextProcess(&process))) { if (noErr == (osErr = GetProcessInformation(&process, &procInfo))) { if (procInfo.processSignature == creator) { AECreateDesc(typeProcessSerialNumber, (Ptr)&process, sizeof(ProcessSerialNumber), theAddress); *PSN = process; return(true); } } } return(false); } In summary, the program snippet accomplishing what you desire would look like this. if (AppLaunched(&theAddress, theCreator, &PSN) { SaveDocumentAs(&theAddress, &docName, &saveName); } If the application is not already launched, there are a number of ways to launch it, but I will leave that as an exercise for the reader, or for a later tutorial if you're still having problems. -- Dale Greer, greer@utdallas.edu "They know that it is human nature to take up causes whereby a man may oppress his neighbor, no matter how unjustly. ... Hence they have had no trouble in finding men who would preach the damnability and heresy of the new doctrine from the very pulpit..." - Galileo Galilei, 1615 --------------------------- >From busfield@hurricane.seas.ucla.edu (John D. Busfield) Subject: Faster Square Root Algorithm Date: Tue, 3 May 1994 17:23:52 GMT Organization: School of Engineering & Applied Science, UCLA. Does anyone have an algorithm or code for computing square roots. The emphasis is on speed rather than accuracy in that I only need the result rounded to the nearest integer (ie sqrt(1000) = 31 or 32). Thanks in advance. John Busfield busfield@hurricane.seas.ucla.edu +++++++++++++++++++++++++++ >From rmah@panix.com (Robert S. Mah) Date: Tue, 03 May 1994 17:41:55 -0500 Organization: One Step Beyond busfield@hurricane.seas.ucla.edu (John D. Busfield) wrote: > Does anyone have an algorithm or code for computing square roots. > The emphasis is on speed rather than accuracy in that I only need the > result rounded to the nearest integer (ie sqrt(1000) = 31 or 32). There is an article on generating very fast square root aproximations in the book "Graphics Gems", Ed. Glassner. It generates a lookup table which is indexed by munging the exponent of the argument and then uses the mantissa to do an (linear?) aproximation to the final result. It's fairly accurate to within a few decimal points. I think the source code is also available somewhere on the net, but I'm afraid I don't know where. Another option, since you only want integer aproximations, is to create a table of squares and do a binary search over it. A 1000 element table will give you a range of 1->1,000,000 with an average of log2(1000) = 10 lookups using a simple binary search. A table with 32K or so elements will have a range of 1->1,000,000,000 and will be searchable in log2(32K) = 30 lookups. If you can spare a few dozen K, then this may be fast enough. Cheers, Rob ___________________________________________________________________________ Robert S. Mah -=- One Step Beyond -=- 212-947-6507 -=- rmah@panix.com +++++++++++++++++++++++++++ >From gewekean@studentg.msu.edu (Andrew Geweke) Date: Tue, 3 May 1994 20:56:35 -0400 Organization: Michigan State University In article <rmah-030594174155@rmah.dialup.access.net>, rmah@panix.com (Robert S. Mah) writes: > Another option, since you only want integer aproximations, is to create > a table of squares and do a binary search over it. A 1000 element > table will give you a range of 1->1,000,000 with an average of log2 > (1000) = 10 lookups using a simple binary search. A table with 32K or > so elements will > > have a range of 1->1,000,000,000 and will be searchable in log2(32K) = 30 > lookups. If you can spare a few dozen K, then this may be fast enough. Actually, I once found this algorithm: int intSqrt (int num) { for (i = j = 1, k = 3; num > j; j += (k+=2), ++i) ; return i; } Note that this algorithm rounds up, not down. I'm not sure exactly how correct this is (I just coded it off the top of my head) but you get the basic idea in any case. Adding the odd integers gives you the squares; i.e. 1 = 1; 1 + 3 = 4; 1 + 3 + 5 = 9; 1 + 3 + 5 + 7 = 16; 1 + 3 + 5 + 7 + 9 = 25; and so on. I don't know how fast this is in your specific instance, but it should be quite fast. +++++++++++++++++++++++++++ >From rmah@panix.com (Robert S. Mah) Date: Tue, 03 May 1994 22:04:22 -0500 Organization: One Step Beyond gewekean@studentg.msu.edu (Andrew Geweke) wrote: > rmah@panix.com (Robert S. Mah) writes: > > Another option, since you only want integer aproximations, is to create > > a table of squares and do a binary search over it. A 1000 element > > [...] > > have a range of 1->1,000,000,000 and will be searchable in log2(32K) = 30 > > lookups. If you can spare a few dozen K, then this may be fast enough. > > Actually, I once found this algorithm: > > int intSqrt (int num) { > for (i = j = 1, k = 3; num > j; j += (k+=2), ++i) > ; > > return i; > } > > Note that this algorithm rounds up, not down. I'm not sure exactly how > correct this is (I just coded it off the top of my head) but you get the > basic idea in any case. Adding the odd integers gives you the squares; > i.e. 1 = 1; 1 + 3 = 4; 1 + 3 + 5 = 9; 1 + 3 + 5 + 7 = 16; 1 + 3 + 5 + > 7 + 9 = 25; and so on. > > I don't know how fast this is in your specific instance, but it should > be quite fast. That's a great little algorithm! That sequence looks _so_ familiar, but for the life of me, I can't recall the name... Anyway, it looks like it would be order O(sqrt(n)) where n is the argument to the function. Only a problem if the domain range for the application is large, that is Sqrt(10000) would take 10 times as long as Sqrt(100). However, unlike the table method, the upper bounds isn't fixed at compile time. Cheers, Rob ___________________________________________________________________________ Robert S. Mah -=- One Step Beyond -=- 212-947-6507 -=- rmah@panix.com +++++++++++++++++++++++++++ >From pottier@corvette.ens.fr (Francois Pottier) Date: 4 May 1994 09:42:29 GMT Organization: Ecole Normale Superieure, PARIS, France In article <busfield.767985832@hurricane>, John D. Busfield <busfield@hurricane.seas.ucla.edu> wrote: > Does anyone have an algorithm or code for computing square roots. There was a thread on this subject about a year ago. Here's what I kept: - ----------------------------------------------------------------------- >Does somebody have code or an algorithm for extracting a long integer >square root and returning an integer? There was a long series of letters in Dr. Dobbs' Journal a few years back that I was a part of. Here are two 'competing' subroutines for the 68000 that I wrote. One is a Newton-Raphson iterator, the other a hybrid of three different subroutines using two different methods, heavily optimized for speed. The non-Newton one is faster in some cases and slower in others. Choosing one as 'best' depends on what kind of input ranges you expect to see most often. Newton's time doesn't vary much on the average based on the input argument. The non-Newton one ranges from a lot better (small arguments) to a little worse (large arguments), but is always better than Newton's worst case. Don't neglect the fact that on a FPU-equipped machine it may be faster to use floating point. I've done no research into this possibility, nor have I timed this stuff on any 68020+ system. The speed balance is no doubt different then. (For that matter, the 68000 testbed had no wait states nor interrupt system overhead, which doesn't necessarily apply to a 68000 PowerBook, and certainly doesn't apply to a Mac+ or earlier.) I'm personally fond of the non-Newton version, because the algorithm only uses shifts and adds, so it could be implemented in microcode with about same speed as a divide. ************************************************************************* * * * Integer Square Root (32 to 16 bit). * * * * (Newton-Raphson method). * * * * Call with: * * D0.L = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.L) * * * * Notes: Result fits in D0.W, but is valid in longword. * * Takes from 338 cycles (1 shift, 1 division) to * * 1580 cycles (16 shifts, 4 divisions) (including rts). * * Averages 854 cycles measured over first 65535 roots. * * Averages 992 cycles measured over first 500000 roots. * * * ************************************************************************* .globl lsqrt * Cycles lsqrt movem.l d1-d2,-(sp) (24) move.l d0,d1 (4) ; Set up for guessing algorithm. beq.s return (10/8) ; Don't process zero. moveq #1,d2 (4) guess cmp.l d2,d1 (6) ; Get a guess that is guaranteed to be bls.s newton (10/8) ; too high, but not by much, by dividing the add.l d2,d2 (8) ; argument by two and multiplying a 1 by 2 lsr.l #1,d1 (10) ; until the power of two passes the modified bra.s guess (10) ; argument, then average these two numbers. newton add.l d1,d2 (8) ; Average the two guesses. lsr.l #1,d2 (10) move.l d0,d1 (4) ; Generate the next approximation(s) divu d2,d1 (140) ; via the Newton-Raphson method. bvs.s done (10/8) ; Handle out-of-range input (cheats!) cmp.w d1,d2 (4) ; Have we converged? bls.s done (10/8) swap d1 (4) ; No, kill the remainder so the clr.w d1 (4) ; next average comes out right. swap d1 (4) bra.s newton (10) done clr.w d0 (4) ; Return a word answer in a longword. swap d0 (4) move.w d2,d0 (4) return movem.l (sp)+,d1-d2 (28) rts (16) end ************************************************************************* * * * Integer Square Root (32 to 16 bit). * * * * (Exact method, not approximate). * * * * Call with: * * D0.L = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.L) * * * * Notes: Result fits in D0.W, but is valid in longword. * * Takes from 122 to 1272 cycles (including rts). * * Averages 610 cycles measured over first 65535 roots. * * Averages 1104 cycles measured over first 500000 roots. * * * ************************************************************************* .globl lsqrt * Cycles lsqrt tst.l d0 (4) ; Skip doing zero. beq.s done (10/8) cmp.l #$10000,d0 (14) ; If is a longword, use the long routine. bhs.s glsqrt (10/8) cmp.w #625,d0 (8) ; Would the short word routine be quicker? bhi.s gsqrt (10/8) ; No, use general purpose word routine. * ; Otherwise fall into special routine. * * For speed, we use three exit points. * This is cheesy, but this is a speed-optimized subroutine! page ************************************************************************* * * * Faster Integer Square Root (16 to 8 bit). For small arguments. * * * * (Exact method, not approximate). * * * * Call with: * * D0.W = Unsigned number. * * * * Returns: * * D0.W = SQRT(D0.W) * * * * Notes: Result fits in D0.B, but is valid in word. * * Takes from 72 (d0=1) to 504 (d0=625) cycles * * (including rts). * * * * Algorithm supplied by Motorola. * * * ************************************************************************* * Use the theorem that a perfect square is the sum of the first * sqrt(arg) number of odd integers. * Cycles move.w d1,-(sp) (8) move.w #-1,d1 (8) qsqrt1 addq.w #2,d1 (4) sub.w d1,d0 (4) bpl qsqrt1 (10/8) asr.w #1,d1 (8) move.w d1,d0 (4) move.w (sp)+,d1 (12) done rts (16) page ************************************************************************* * * * Integer Square Root (16 to 8 bit). * * * * (Exact method, not approximate). * * * * Call with: * * D0.W = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.W) * * * * Uses: D1-D4 as temporaries -- * * D1 = Error term; * * D2 = Running estimate; * * D3 = High bracket; * * D4 = Loop counter. * * * * Notes: Result fits in D0.B, but is valid in word. * * * * Takes from 544 to 624 cycles (including rts). * * * * Instruction times for branch-type instructions * * listed as (X/Y) are for (taken/not taken). * * * ************************************************************************* * Cycles gsqrt movem.l d1-d4,-(sp) (40) move.w #7,d4 (8) ; Loop count (bits-1 of result). clr.w d1 (4) ; Error term in D1. clr.w d2 (4) sqrt1 add.w d0,d0 (4) ; Get 2 leading bits a time and add addx.w d1,d1 (4) ; into Error term for interpolation. add.w d0,d0 (4) ; (Classical method, easy in binary). addx.w d1,d1 (4) add.w d2,d2 (4) ; Running estimate * 2. move.w d2,d3 (4) add.w d3,d3 (4) cmp.w d3,d1 (4) bls.s sqrt2 (10/8) ; New Error term > 2* Running estimate? addq.w #1,d2 (4) ; Yes, we want a '1' bit then. addq.w #1,d3 (4) ; Fix up new Error term. sub.w d3,d1 (4) sqrt2 dbra d4,sqrt1 (10/14) ; Do all 8 bit-pairs. move.w d2,d0 (4) movem.l (sp)+,d1-d4 (44) rts (16) page ************************************************************************* * * * Integer Square Root (32 to 16 bit). * * * * (Exact method, not approximate). * * * * Call with: * * D0.L = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.L) * * * * Uses: D1-D4 as temporaries -- * * D1 = Error term; * * D2 = Running estimate; * * D3 = High bracket; * * D4 = Loop counter. * * * * Notes: Result fits in D0.W, but is valid in longword. * * * * Takes from 1080 to 1236 cycles (including rts.) * * * * Two of the 16 passes are unrolled from the loop so that * * quicker instructions may be used where there is no * * danger of overflow (in the early passes). * * * * Instruction times for branch-type instructions * * listed as (X/Y) are for (taken/not taken). * * * ************************************************************************* * Cycles glsqrt movem.l d1-d4,-(sp) (40) moveq #13,d4 (4) ; Loop count (bits-1 of result). moveq #0,d1 (4) ; Error term in D1. moveq #0,d2 (4) lsqrt1 add.l d0,d0 (8) ; Get 2 leading bits a time and add addx.w d1,d1 (4) ; into Error term for interpolation. add.l d0,d0 (8) ; (Classical method, easy in binary). addx.w d1,d1 (4) add.w d2,d2 (4) ; Running estimate * 2. move.w d2,d3 (4) add.w d3,d3 (4) cmp.w d3,d1 (4) bls.s lsqrt2 (10/8) ; New Error term > 2* Running estimate? addq.w #1,d2 (4) ; Yes, we want a '1' bit then. addq.w #1,d3 (4) ; Fix up new Error term. sub.w d3,d1 (4) lsqrt2 dbra d4,lsqrt1 (10/14) ; Do first 14 bit-pairs. add.l d0,d0 (8) ; Do 15-th bit-pair. addx.w d1,d1 (4) add.l d0,d0 (8) addx.l d1,d1 (8) add.w d2,d2 (4) move.l d2,d3 (4) add.w d3,d3 (4) cmp.l d3,d1 (6) bls.s lsqrt3 (10/8) addq.w #1,d2 (4) addq.w #1,d3 (4) sub.l d3,d1 (8) lsqrt3 add.l d0,d0 (8) ; Do 16-th bit-pair. addx.l d1,d1 (8) add.l d0,d0 (8) addx.l d1,d1 (8) add.w d2,d2 (4) move.l d2,d3 (4) add.l d3,d3 (8) cmp.l d3,d1 (6) bls.s lsqrt4 (10/8) addq.w #1,d2 (4) lsqrt4 move.w d2,d0 (4) movem.l (sp)+,d1-d4 (44) rts (16) end -- +----------------+ ! II CCCCCC ! Jim Cathey ! II SSSSCC ! ISC-Bunker Ramo ! II CC ! TAF-C8; Spokane, WA 99220 ! IISSSS CC ! UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com) ! II CCCCCC ! (509) 927-5757 +----------------+ - ----------------------------------------------------------------------------- /* * ISqrt-- * Find square root of n. This is a Newton's method approximation, * converging in 1 iteration per digit or so. Finds floor(sqrt(n) + 0.5). */ int ISqrt(register int n) { register int i; register long k0, k1, nn; for (nn = i = n, k0 = 2; i > 0; i >>= 2, k0 <<= 1) ; nn <<= 2; for (;;) { k1 = (nn / k0 + k0) >> 1; if (((k0 ^ k1) & ~1) == 0) break; k0 = k1; } return (int) ((k1 + 1) >> 1); } -- Joseph Nathan Hall | Joseph's Rule of Thumb: Most folks aren't interested Software Architect | in your rules of thumb. Gorca Systems Inc. | ----- joseph@joebloe.maple-shade.nj.us (home) ----- (on assignment) | (602) 732-2549 jnhall@sat.mot.com (work) - ----------------------------------------------------------------------------- Here's an article I saved from c.s.m.p four months ago. Strangely, it was only distributed in North America. I don't know how fast it works in practice, but there are no multiplications or divisions in its inner loop, which is promising. #include <math.h> /* * It requires more space to describe this implementation of the manual * square root algorithm than it did to code it. The basic idea is that * the square root is computed one bit at a time from the high end. Because * the original number is 32 bits (unsigned), the root cannot exceed 16 bits * in length, so we start with the 0x8000 bit. * * Let "x" be the value whose root we desire, "t" be the square root * that we desire, and "s" be a bitmask. A simple way to compute * the root is to set "s" to 0x8000, and loop doing the following: * * t = 0; * s = 0x8000; * do { * if ((t + s) * (t + s) <= x) * t += s; * s >>= 1; * while (s != 0); * * The primary disadvantage to this approach is the multiplication. To * eliminate this, we begin simplying. First, we observe that * * (t + s) * (t + s) == (t * t) + (2 * t * s) + (s * s) * * Therefore, if we redefine "x" to be the original argument minus the * current value of (t * t), we can determine if we should add "s" to * the root if * * (2 * t * s) + (s * s) <= x * * If we define a new temporary "nr", we can express this as * * t = 0; * s = 0x8000; * do { * nr = (2 * t * s) + (s * s); * if (nr <= x) { * x -= nr; * t += s; * } * s >>= 1; * while (s != 0); * * We can improve the performance of this by noting that "s" is always a * power of two, so multiplication by "s" is just a shift. Also, because * "s" changes in a predictable manner (shifted right after each iteration) * we can precompute (0x8000 * t) and (0x8000 * 0x8000) and then adjust * them by shifting after each step. First, we let "m" hold the value * (s * s) and adjust it after each step by shifting right twice. We * also introduce "r" to hold (2 * t * s) and adjust it after each step * by shifting right once. When we update "t" we must also update "r", * and we do so by noting that (2 * (old_t + s) * s) is the same as * (2 * old_t * s) + (2 * s * s). Noting that (s * s) is "m" and that * (r + 2 * m) == ((r + m) + m) == (nr + m): * * t = 0; * s = 0x8000; * m = 0x40000000; * r = 0; * do { * nr = r + m; * if (nr <= x) { * x -= nr; * t += s; * r = nr + m; * } * s >>= 1; * r >>= 1; * m >>= 2; * } while (s != 0); * * Finally, we note that, if we were using fractional arithmetic, after * 16 iterations "s" would be a binary 0.5, so the value of "r" when * the loop terminates is (2 * t * 0.5) or "t". Because the values in * "t" and "r" are identical after the loop terminates, and because we * do not otherwise use "t" explicitly within the loop, we can omit it. * When we do so, there is no need for "s" except to terminate the loop, * but we observe that "m" will become zero at the same time as "s", * so we can use it instead. * * The result we have at this point is the floor of the square root. If * we want to round to the nearest integer, we need to consider whether * the remainder in "x" is greater than or equal to the difference * between ((r + 0.5) * (r + 0.5)) and (r * r). Noting that the former * quantity is (r * r + r + 0.25), we want to check if the remainder is * greater than or equal to (r + 0.25). Because we are dealing with * integers, we can't have equality, so we round up if "x" is strictly * greater than "r": * * if (x > r) * r++; */ unsigned int isqrt(unsigned long x) { unsigned long r, nr, m; r = 0; m = 0x40000000; do { nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; } while (m != 0); if (x > r) r++; return(r); } -- (Dr.) John Bruner, Deputy Director bruner@csrd.uiuc.edu Center for Supercomputing Research & Development (217) 244-4476 (voice) University of Illinois at Urbana-Champaign (217) 244-1351 (FAX) 465 CSRL, MC-264; 1308 West Main St.; Urbana, IL 61801-2307 -- Jamie McCarthy Internet: k044477@kzoo.edu AppleLink: j.mccarthy - ----------------------------------------------------------------------------- Then there is the easy way - use the toolbox: #define LongSqrt(x) ((short)(FracSqrt((Fract)(x)) >> 15)) FractSqrt is in FixMath.h. This works because given two fixed point number of the form x.y, where x is the number of bits before the decimal point and y is the number of bits after the decimal point, twos complement multiplication will yield a result of: x1.y1 * x2.y2 = x1+x2.y1+y2. If the numbers are the same format, this becomes x.y * x.y = 2x.2y. This holds for squaring a number also: x.y^2 = 2x.2y. The sqrt is then sqrt(x.y) = x/2.y/2. FracSqrt is of the form FracSqrt(2.30) = 2.30. Since we know how sqrt works this can also be expressed as FracSqrt(x.y) = x/2.y/2 << 15 (This is a "virtual" shift since the calculation is compleated with more precision the 15 bits shifted in are "real" and not 0). So if you want to use FracSqrt for a long you get FracSqrt(32.0) = 16.0 << 15. So you shift the result down be 15 to get a short (you can add 1 << 14 prior to shifting down by 15 to round instead of truncing your result). If you wanted a long sqrt with a 16.16 (Fixed) result you would use: #define LongFixedSqrt(x) ((Fixed)(FracSqrt((Fract)(x)) << 1)) If you want a Fixed (16.16) sqrt with a Fixed result (16.16) you would use: #define FixedSqrt(x) ((Fixed)FracSqrt((Fract)(x)) >> 7)) You can do this kind of work with all of the fixed point routines and standard operators. Sean Parent -- Francois Pottier pottier@dmi.ens.fr - ---------------------------------------------------------------------------- This area dedicated to the preservation of endangered species. "Moof!" +++++++++++++++++++++++++++ >From christer@cs.umu.se (Christer Ericson) Date: Wed, 4 May 1994 12:16:40 GMT Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden In <busfield.767985832@hurricane> busfield@hurricane.seas.ucla.edu (John D. Busfield) writes: > Does anyone have an algorithm or code for computing square roots. >The emphasis is on speed rather than accuracy in that I only need the >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32). >Thanks in advance. I posted this to comp.sys.m68k some time ago, here it is again: unsigned short ce_quick_sqrt(n) register unsigned long n; { asm { ;------------------------- ; ; Routine : ISQRT; integer square root ; I/O parameters: d0.w = sqrt(d0.l) ; Registers used: d0-d2/a0 ; ; This is a highly optimized integer square root routine, based ; on the iterative Newton/Babylonian method ; ; r(n + 1) = (r(n) + A / R(n)) / 2 ; ; Verified over the full interval [0x0L,0xFFFFFFFFL] ; ; Some minor compromises have been made to make it perform well ; on the 68000 as well as 68020/030/040. This routine outperforms ; the best of all other algorithms I've seen (except for a table ; lookup). ; ; Copyright (c) Christer Ericson, 1993. All rights reserved. ; ; Christer Ericson, Dept. of Computer Science, University of Umea, ; S-90187 UMEA, SWEDEN. Internet: christer@cs.umu.se ; ;------------------------- ; ; Macintosh preamble since THINK C passes first register param ; in d7. Remove this section on any other machine ; move.l d7,d0 ;------------------------- ; ; Actual integer square root routine starts here ; move.l d0,a0 ; save original input value for the iteration beq.s @exit ; unfortunately we must special case zero moveq #2,d1 ; init square root guess ;------------------------- ; ; Speed up the process of finding a power of two approximation ; to the actual square root by shifting an appropriate amount ; when the input value is large enough ; ; If input values are word only, this section can be removed ; move.l d0,d2 swap d2 ; do [and.l 0xFFFF0000,d2] this way to... tst.w d2 ; go faster on 68000 and to avoid having to... beq.s @skip8 ; reload d2 for the next test below move.w #0x200,d1 ; faster than lsl.w #8,d1 (68000) lsr.l #8,d0 @skip8 and.w #0xFE00,d2 ; this value and shift by 5 are magic beq.s @skip4 lsl.w #5,d1 lsr.l #5,d0 @skip4 ;------------------------- ; ; This finds the power of two approximation to the actual square root ; by doing the integer equivalent to sqrt(x) = 2 ^ (log2(x) / 2). This ; is done by shifting the input value down while shifting the root ; approximation value up until they meet in the middle. Better precision ; (in the step described below) is gained by starting the approximation ; at 2 instead of 1 (this means that the approximation will be a power ; of two too large so remember to shift it down). ; ; Finally (and here's the clever thing) since, by previously shifting the ; input value down, it has actually been divided by the root approximation ; value already so the first iteration is "for free". Not bad, eh? ; @loop add.l d1,d1 lsr.l #1,d0 cmp.l d0,d1 bcs.s @loop @skip lsr.l #1,d1 ; adjust the approximation add.l d0,d1 ; here we just add and shift to... lsr.l #1,d1 ; get the first iteration "for free"! ;------------------------- ; ; The iterative root value is guaranteed to be larger than (or equal to) ; the actual square root, except for the initial guess. But since the first ; iteration was done above, the loop test can be simplified below ; (Oh, without the bvs.s the routine will fail on most large numbers, like ; for instance, 0xFFFF0000. Could there be a better way of handling these, ; so the bvs.s can be removed? Nah...) ; @loop2 move.l a0,d2 ; get original input value move.w d1,d0 ; save current guess divu.w d1,d2 ; do the Newton method thing bvs.s @exit ; if div overflows, exit with current guess add.w d2,d1 roxr.w #1,d1 ; roxr ensures shifting back carry overflow cmp.w d0,d1 bcs.s @loop2 ; exit with result in d0.w @exit } } Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794 Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN +++++++++++++++++++++++++++ >From usenet@lowry.eche.ualberta.ca (Brian Lowry) Date: 4 May 1994 16:38:23 GMT Organization: Dept. of Chemical Eng., University of Alberta Well, if people are going to start posting square root routines that use division, here's about the simplest possible, coded in C: #define precLim 1e-8 //for good single precision performance float sqrt(float x); float sqrt(float x) { register float d = 0.5*(1.0 - x); register float b = 1.0 - d; while ((d > precLim)||(d < precLim)) { d *= 0.5*d/b; b -= d;} return b; } This takes about three to five iterations, but on the PowerPC that amounts to a whopping 100 to 170 clock cycles. Since that's enough cycles for about 80 multiplies and/or flops, division is definitely not the best route here (though it's certainly compact...) Brian Lowry +++++++++++++++++++++++++++ >From parkb@bigbang.Stanford.EDU (Brian Park) Date: 4 May 1994 20:43:37 GMT Organization: Stanford University I am assuming that the original poster wanted integer in/ integer out since he only wanted accurary within 1. The following is a description of code which follows at the end of this article. lsqrt2() - traditional Newton's method - ------ The traditional Newton's method of getting y = sqrt(n), coded as lsqrt2() below) n = y^2 y(0) = n y(i+1) = (y(i) + n/y(i))/2 is second-order convergent, meaning that the relative error in the i+1 iteration is proportional to the square of the previous error (remember, at sufficiently large iteration, relative error << 1). [Do a Taylor series exansion to verify that the first order term vanishes]. Hence, for a fixed tolerance, in this case, within unity (ie. error of 1/n), the algorithm performs a bisection on the error at every iteration. My point is that Newton's method is O(log_2(n)) and you are not going to do much better than that. Binary lookup tables are O(log_2(n)) efficient. The only thing you might save is an integer division compared to the Newton's method (see n/y(i) above). But it also globbles up memory and has fixed size. A hash table provides the fastest lookup table, but it takes up even more memory than the binary lookup table, and coding it can be more tricky. Do you really want to use up that much memory for a simple square root function? I should add that Newton's method is guaranteed to converge for all n. lsqrt1() - bit-shift method - ----- There is yet another method (see lsqrt1() function below) that I dreamed up last night, and coded it quickly this morning. It uses bit-shifts to perform an integer square root by taking log_2(n), dividing it by 2 (hence doing a square root) and shifting it back. This routine is also O(log_2(n)) but has the peculiar property that it is _faster_ for larger numbers. You can see why below. It has the disadvantage that it is quite inaccurate for small numbers, but surprisingly accurate for large numbers. lsqrt3() - combination - ------ If you really need accurary within unity, then you can plug the result of the bit-shift method, lsqrt1(), use it as the initial guess for the Newton's method and converge the answer down to its limiting accurary. This combination is really really fast, since Newton's method converges within an iteration or two when the initial guess is very close to the actual answer. This has been coded as lsqrt3() below. For comparision, the straight Newton's method takes longer because its initial guess is simply the original answer, n. But let me empasize that even with straight Newton's method is O(log_2(n)), Sqrt(10^9) takes only 19 iterations, as you can verify. That's only 19 integer divisions. How fast do you want your square root to be? I have even more ideas on how to speed things up, but I think the code below should get you going. I'm sorry it's not coded in Macintosh, I'm on the Unix box right now, and my mac's at home. It hasn't been tested extensively on the boundary points (ie. 0,1, 2^32-1), but you can do that. My final comment concerns the O(sqrt(n)) algorithm that was posted a short while ago. Although it is simple and cute, I would not recommend it when you have so many O(log_2(n)) routines to choose from. Regards, Brian Park - - cut for code below --- #include <stdio.h> #define BITSIZE (sizeof(unsigned long)*8) unsigned long lsqrtng(); unsigned long lsqrt1(); unsigned long lsqrt2(); unsigned long lsqrt3(); /*** Usage: lsqrt n Example: lsqrt 100000 l = 1000000 bit shift: 976 iter = 13 newton: 1000 iter = 2 bit shift + newton: 1000 ***/ main(int argc, char **argv) { unsigned long l; l = atol(argv[1]); printf("l = %lu\n", l); printf("bit shift: %lu\n", lsqrt1(l)); printf("newton: %lu\n", lsqrt2(l)); printf("bit shift + newton: %lu\n", lsqrt3(l)); } /*** Take square root by performing bit-shifts, getting the log_2(a), dividing by 2, and shifting back. ***/ unsigned long lsqrt1(unsigned long a) { register unsigned int i; register unsigned long l = a; if (a==0) return 0; for (i=0; l<(unsigned long)(1L<<(BITSIZE-1)) && i<BITSIZE; ++i) l<<=1; i = (32-i)>>1; a >>= i; return a; } /*** Take square root by using Newton's method with an initial guess of itself. ***/ unsigned long lsqrt2(unsigned long a) { return lsqrtng(a, a); } /*** Newton's method with guess. Newton's method is second order convergence, hence it should be very fast. ***/ unsigned long lsqrtng(register unsigned long a, unsigned long guess) { register unsigned long b, b0; int i = 0; if (a == 0) return 0; b = guess; b0 = a/b; while (labs(b-b0)>1) { /* printf("guess = %lu\n", b);*/ ++i; b0 = b; b=(a/b+b)>>1; } printf("iter = %d\n", i); return b; } /*** Combination Newton's method and bit-shifts. ***/ unsigned long lsqrt3(unsigned long a) { return lsqrtng(a, lsqrt1(a)); } +++++++++++++++++++++++++++ >From sparent@mv.us.adobe.com (Sean Parent) Date: Wed, 4 May 1994 20:46:30 GMT Organization: Adobe Systems Incorporated In article <9405032056.AA35937@geweke.ppp.msu.edu>, gewekean@studentg.msu.edu (Andrew Geweke) wrote: > Actually, I once found this algorithm: > > int intSqrt (int num) { > for (i = j = 1, k = 3; num > j; j += (k+=2), ++i) > ; > > return i; > } This is a difference algorithim and similar algorithms can be used to solve most any polynomial expression. This is the method used by Charles Babage (sp?) in the late 1800's in his difference engine. For example: y = x^3 can be calculated over a range with a given step in x of n by: y = y` + a Where a = (x + n)^3 - x^3. Or a = 3nx^2 + 3xn^2 + n^3. The x^2 term (i) can be similary reduce: i = i` + b Where b = (x + n)^2 - x^2. Or b = 2xn + n^2. Finaly the x term is reduced as: x = x` + n This can be coded up as a loop consisting of only additions. As to the original question. To calculate a square root of an integer on a Mac, I would code it as: inline long LongSqrt(long x) { return FracSqrt(x) >> 15; } FracSqrt is in FixMath.h. This works because squaring a fixed point number doubles the number of fractional bits (so a number of type Fract, which is 2d30 multiplied by a Fract would yield a 4d60 number). Taking the square root of a number then halves the number of bits (to the original precision). So a strait square root of a fract would yield a 1d15 value. Apple's FracSqrt return a Fract so in effect it is calculating sqrt(x) << 15 [with the bits shifted in actually being calculated]. So if you want a number of type Fixed as a result for the sqrt of an integer you could use. inline Fixed FixedSqrtLong(long x) { return FracSqrt(x) << 1; // 0/2 + 15 + 1 = 16 } Or a the sqrt of a Fixed yielding a fixed is: inline Fixed FixedSqrt(Fixed x) { return FracSqrt(x) >> 7; // 16/2 + 15 - 7 = 16 } This is a difficult thing to explain, I don't no of any good terminology for talking about fixed point numbers. Play with it some and it will make sense. -- Sean Parent +++++++++++++++++++++++++++ >From jimc@tau-ceti.isc-br.com (Jim Cathey) Date: Wed, 4 May 1994 18:04:33 GMT Organization: Olivetti North America, Spokane, WA In article <busfield.767985832@hurricane> busfield@hurricane.seas.ucla.edu (John D. Busfield) writes: > Does anyone have an algorithm or code for computing square roots. >The emphasis is on speed rather than accuracy in that I only need the >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32). Here's two: A Newton's Method, and a cool bit-banger that's a little faster under certain circumstances. (Timings were all on a 68000 as these are kinda old. Re-test to see what works best now.) ************************************************************************* * * * Integer Square Root (32 to 16 bit). * * * * (Newton-Raphson method). * * * * Call with: * * D0.L = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.L) * * * * Notes: Result fits in D0.W, but is valid in longword. * * Takes from 338 cycles (1 shift, 1 division) to * * 1580 cycles (16 shifts, 4 divisions) (including rts). * * Averages 854 cycles measured over first 65535 roots. * * Averages 992 cycles measured over first 500000 roots. * * * ************************************************************************* .globl lsqrt * Cycles lsqrt movem.l d1-d2,-(sp) (24) move.l d0,d1 (4) ; Set up for guessing algorithm. beq.s return (10/8) ; Don't process zero. moveq #1,d2 (4) guess cmp.l d2,d1 (6) ; Get a guess that is guaranteed to be bls.s newton (10/8) ; too high, but not by much, by dividing the add.l d2,d2 (8) ; argument by two and multiplying a 1 by 2 lsr.l #1,d1 (10) ; until the power of two passes the modified bra.s guess (10) ; argument, then average these two numbers. newton add.l d1,d2 (8) ; Average the two guesses. lsr.l #1,d2 (10) move.l d0,d1 (4) ; Generate the next approximation(s) divu d2,d1 (140) ; via the Newton-Raphson method. bvs.s done (10/8) ; Handle out-of-range input (cheats!) cmp.w d1,d2 (4) ; Have we converged? bls.s done (10/8) swap d1 (4) ; No, kill the remainder so the clr.w d1 (4) ; next average comes out right. swap d1 (4) bra.s newton (10) done clr.w d0 (4) ; Return a word answer in a longword. swap d0 (4) move.w d2,d0 (4) return movem.l (sp)+,d1-d2 (28) rts (16) end ************************************************************************* * * * Integer Square Root (32 to 16 bit). * * * * (Exact method, not approximate). * * * * Call with: * * D0.L = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.L) * * * * Notes: Result fits in D0.W, but is valid in longword. * * Takes from 122 to 1272 cycles (including rts). * * Averages 610 cycles measured over first 65535 roots. * * Averages 1104 cycles measured over first 500000 roots. * * * ************************************************************************* .globl lsqrt * Cycles lsqrt tst.l d0 (4) ; Skip doing zero. beq.s done (10/8) cmp.l #$10000,d0 (14) ; If is a longword, use the long routine. bhs.s glsqrt (10/8) cmp.w #625,d0 (8) ; Would the short word routine be quicker? bhi.s gsqrt (10/8) ; No, use general purpose word routine. * ; Otherwise fall into special routine. * * For speed, we use three exit points. * This is cheesy, but this is a speed-optimized subroutine! page ************************************************************************* * * * Faster Integer Square Root (16 to 8 bit). For small arguments. * * * * (Exact method, not approximate). * * * * Call with: * * D0.W = Unsigned number. * * * * Returns: * * D0.W = SQRT(D0.W) * * * * Notes: Result fits in D0.B, but is valid in word. * * Takes from 72 (d0=1) to 504 (d0=625) cycles * * (including rts). * * * * Algorithm supplied by Motorola. * * * ************************************************************************* * Use the theorem that a perfect square is the sum of the first * sqrt(arg) number of odd integers. * Cycles move.w d1,-(sp) (8) move.w #-1,d1 (8) qsqrt1 addq.w #2,d1 (4) sub.w d1,d0 (4) bpl qsqrt1 (10/8) asr.w #1,d1 (8) move.w d1,d0 (4) move.w (sp)+,d1 (12) done rts (16) page ************************************************************************* * * * Integer Square Root (16 to 8 bit). * * * * (Exact method, not approximate). * * * * Call with: * * D0.W = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.W) * * * * Uses: D1-D4 as temporaries -- * * D1 = Error term; * * D2 = Running estimate; * * D3 = High bracket; * * D4 = Loop counter. * * * * Notes: Result fits in D0.B, but is valid in word. * * * * Takes from 544 to 624 cycles (including rts). * * * * Instruction times for branch-type instructions * * listed as (X/Y) are for (taken/not taken). * * * ************************************************************************* * Cycles gsqrt movem.l d1-d4,-(sp) (40) move.w #7,d4 (8) ; Loop count (bits-1 of result). clr.w d1 (4) ; Error term in D1. clr.w d2 (4) sqrt1 add.w d0,d0 (4) ; Get 2 leading bits a time and add addx.w d1,d1 (4) ; into Error term for interpolation. add.w d0,d0 (4) ; (Classical method, easy in binary). addx.w d1,d1 (4) add.w d2,d2 (4) ; Running estimate * 2. move.w d2,d3 (4) add.w d3,d3 (4) cmp.w d3,d1 (4) bls.s sqrt2 (10/8) ; New Error term > 2* Running estimate? addq.w #1,d2 (4) ; Yes, we want a '1' bit then. addq.w #1,d3 (4) ; Fix up new Error term. sub.w d3,d1 (4) sqrt2 dbra d4,sqrt1 (10/14) ; Do all 8 bit-pairs. move.w d2,d0 (4) movem.l (sp)+,d1-d4 (44) rts (16) page ************************************************************************* * * * Integer Square Root (32 to 16 bit). * * * * (Exact method, not approximate). * * * * Call with: * * D0.L = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.L) * * * * Uses: D1-D4 as temporaries -- * * D1 = Error term; * * D2 = Running estimate; * * D3 = High bracket; * * D4 = Loop counter. * * * * Notes: Result fits in D0.W, but is valid in longword. * * * * Takes from 1080 to 1236 cycles (including rts.) * * * * Two of the 16 passes are unrolled from the loop so that * * quicker instructions may be used where there is no * * danger of overflow (in the early passes). * * * * Instruction times for branch-type instructions * * listed as (X/Y) are for (taken/not taken). * * * ************************************************************************* * Cycles glsqrt movem.l d1-d4,-(sp) (40) moveq #13,d4 (4) ; Loop count (bits-1 of result). moveq #0,d1 (4) ; Error term in D1. moveq #0,d2 (4) lsqrt1 add.l d0,d0 (8) ; Get 2 leading bits a time and add addx.w d1,d1 (4) ; into Error term for interpolation. add.l d0,d0 (8) ; (Classical method, easy in binary). addx.w d1,d1 (4) add.w d2,d2 (4) ; Running estimate * 2. move.w d2,d3 (4) add.w d3,d3 (4) cmp.w d3,d1 (4) bls.s lsqrt2 (10/8) ; New Error term > 2* Running estimate? addq.w #1,d2 (4) ; Yes, we want a '1' bit then. addq.w #1,d3 (4) ; Fix up new Error term. sub.w d3,d1 (4) lsqrt2 dbra d4,lsqrt1 (10/14) ; Do first 14 bit-pairs. add.l d0,d0 (8) ; Do 15-th bit-pair. addx.w d1,d1 (4) add.l d0,d0 (8) addx.l d1,d1 (8) add.w d2,d2 (4) move.l d2,d3 (4) add.w d3,d3 (4) cmp.l d3,d1 (6) bls.s lsqrt3 (10/8) addq.w #1,d2 (4) addq.w #1,d3 (4) sub.l d3,d1 (8) lsqrt3 add.l d0,d0 (8) ; Do 16-th bit-pair. addx.l d1,d1 (8) add.l d0,d0 (8) addx.l d1,d1 (8) add.w d2,d2 (4) move.l d2,d3 (4) add.l d3,d3 (8) cmp.l d3,d1 (6) bls.s lsqrt4 (10/8) addq.w #1,d2 (4) lsqrt4 move.w d2,d0 (4) movem.l (sp)+,d1-d4 (44) rts (16) end -- +----------------+ ! II CCCCCC ! Jim Cathey ! II SSSSCC ! ISC-Bunker Ramo ! II CC ! TAF-C8; Spokane, WA 99220 ! IISSSS CC ! UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com) ! II CCCCCC ! (509) 927-5757 +++++++++++++++++++++++++++ >From lasley@umdsp.umd.edu (Scott E. Lasley) Date: Wed, 4 May 1994 20:25:02 -0400 Organization: Space Physics Group, University of Maryland In article <rmah-030594174155@rmah.dialup.access.net>, rmah@panix.com (Robert S. Mah) writes: > There is an article on generating very fast square root aproximations > in the book "Graphics Gems", Ed. Glassner. > > It generates a lookup table which is indexed by munging the exponent > of the argument and then uses the mantissa to do an (linear?) > aproximation to the final result. It's fairly accurate to within a > few decimal points. I think the source code is also available > somewhere on the net, but I'm afraid I don't know where. some (perhaps all) of the code in the Graphics Gems series of books can be found at ftp.princeton.edu. here is a URL ftp://ftp.princeton.edu//pub/Graphics/GraphicsGems there are directories for books I through IV. the ReadMe file in the GemsIII directory contains the following line: sqrt.c II.1 Steve Hill, "IEEE Fast Square Root" hope this helps, lasley@umdsp.umd.edu "... and there were bumper stickers saying "Free Love" like it was some kind of political prisoner..." David Wilcox +++++++++++++++++++++++++++ >From nagle@netcom.com (John Nagle) Date: Thu, 5 May 1994 02:23:34 GMT Organization: NETCOM On-line Communication Services (408 241-9760 guest) busfield@hurricane.seas.ucla.edu (John D. Busfield) writes: > Does anyone have an algorithm or code for computing square roots. >The emphasis is on speed rather than accuracy in that I only need the >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32). >Thanks in advance. The classic for floating point is to check the exponent and if odd, shift the mantissa right one bit. Then halve the exponent. Use the high 4 bits of the mantissa to select a starting mantissa from a table of 16 values. Then execute Newton's Method for two or three iterations, written out in line. On a machine with an FPU, this beats most iterative methods. Try it on a PPC. This may actually be faster than an iterative integer method. John Nagle +++++++++++++++++++++++++++ >From Ron_Hunsinger@bmug.org (Ron Hunsinger) Date: Thu, 5 May 94 05:35:48 PST Organization: Berkeley Macintosh Users Group parkb@bigbang.Stanford.EDU (Brian Park) writes: >lsqrt2() - traditional Newton's method >-------- >The traditional Newton's method of getting y = sqrt(n), coded as lsqrt2() >below) > >n = y^2 >y(0) = n >y(i+1) = (y(i) + n/y(i))/2 > >is second-order convergent, You can do square root much faster than this. Square root is about as complex, and takes about as much time, as a divide. A square root is like a division in which the quotient and divisor are equal. The usual division algorithm knows the divisor from the beginning, and discovers the quotient a bit at a time. You just have to tweak the algorithm a little so that it discovers the divisor and quotient in parallel, a bit at a time. Think about how you would calculate square root by hand. Then try to do it in binary. Notice how easy it becomes. >My point is that Newton's method is O(log_2(n)) and you are not going >to do much better than that. Newton's method is O(log n) ITERATIONS, but you do a divide in each iteration. If division takes O(log n) shift-subtracts, then you have a run time of O((log n)^2) shift-subtracts, compared to only O(log n) shift-subtracts using a modified divide algorithm. [Not that it really matters. You aren't writing an arbitrary-precision square root routine. You have a fixed upper bound on the size of the input, and ANY operation with only a finite number of valid inputs can be implemented in O(1) time.] [And if you have a very fast divide, as fast as a shift-subtract, as might happen if you use a hardware divide instruction on a RISC computer, then both methods might have nearly the same speed. But if shifting is faster than division, Newton's method is going to lose.] -Ron Hunsinger +++++++++++++++++++++++++++ >From afcjlloyd@aol.com (AFC JLloyd) Date: 5 May 1994 12:37:02 -0400 Organization: America Online, Inc. (1-800-827-6364) In article <2q91dp$rf@nntp2.Stanford.EDU>, parkb@bigbang.Stanford.EDU (Brian Park) writes: >>> But let me empasize that even with straight Newton's method is O(log_2(n)), Sqrt(10^9) takes only 19 iterations, as you can verify. That's only 19 integer divisions. How fast do you want your square root to be? <<< If you simply provide a better starting estimate you can reduce your 19 iterations down to a maximum of 5. A very cheap way to provide a better estimate is to find the most significant bit and use it to estimate log2(N) and thus sqrt(N). I don't know about you, but when I want fast code I'll take a factor of four anyway I can get. However, if you're really out for speed, it turns out that Newton's method is not the best. Yesterday, pottier@corvette.ens.fr (Francois Pottier) posted an algorithm by (Dr.) John Bruner, bruner@csrd.uiuc.edu: >>>> unsigned int isqrt(unsigned long x) { unsigned long r, nr, m; r = 0; m = 0x40000000; do { nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; } while (m != 0); return(r); } <<<< (I've removed two lines of code that round the result up if rounding to nearest integer is desired). This is an incredibly great algorithm, which uses exactly 16 iterations but no integer multiply or divides. However, if you really want the fastest possible code, you should rewrite it like this: unsigned short isqrt(register unsigned long x) { register unsigned long r; register unsigned long nr; register unsigned long m; r = 0; m = 0x40000000; nr = m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; return(r); } This code is 40% faster (on my 840av) than code that I have been using that used 4 iterations of Newton-Raphson after seeding along the lines I mentioned above. It may be possible to speed this up even further by using the fast estimate of Log2(N) to reduce the number of iterations from 16 to the absolute minimum (ceil(Log2(N)/2), though I expect the improvement to be minimal since each iteration of this algorithm is so cheap. Jim Lloyd afcjlloyd@aol.com +++++++++++++++++++++++++++ >From trussler@panix.com (Tom Russler) Date: Fri, 06 May 1994 00:38:31 -0400 Organization: PANIX Public Access Internet and UNIX (NYC) busfield@hurricane.seas.ucla.edu (John D. Busfield) writes: > Does anyone have an algorithm or code for computing square roots. >The emphasis is on speed rather than accuracy in that I only need the >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32). >Thanks in advance. In 68000 assembly in THINK C: move.l number, d0 move.l #1, d1 move.l d0, d2 @1 cmp.l d1, d2 ble @2 lsl.l #1, d1 lsr.l #1, d2 bra @1 @2 add.l d2, d1 lsr.l #1, d1 divu d1, d0 add.w d1, d0 add.w #1, d0 lsr.w #1, d0 -- Tom Russler trussler@panix.com +++++++++++++++++++++++++++ >From zstern@adobe.com (Zalman Stern) Date: Fri, 6 May 1994 04:35:37 GMT Organization: Adobe Systems Incorporated Ron_Hunsinger@bmug.org (Ron Hunsinger) writes > [And if you have a very fast divide, as fast as a shift-subtract, as > might happen if you use a hardware divide instruction on a RISC computer, This is not even close to true for any RISC implementation I know of. If the divisor is loop-invariant, one can multiply by the reciprocal which may be a single cycle operation, however this is not the case in the example being discussed. Most RISC architecture now include a floating-point square root instruction. This is partially due to this operations similarity to division (i.e. can use the same hardware) and partially due to its common appearance in electrical CAD codes :-) I believe the integer square root routine I use is based on the algorithm Ron referred to. It has the following comment at the top (thanks to Mark Hamburg): /* * A square-root routine. See Dijkstra, "A Discipline of Programming", * page 65. */ I threw in an optimization to use the PowerPC count-leading-zeros instruction to efficiently reduce the number of iterations. (There is a similar instruction on the 68K.) The result is competitive with a lookup table based method for computing sqrt(x^2 + y^2) that uses a divide. (I'd post the code, but I didn't write it and I'm sure Adobe would frown on my doing so. Besides, reading Dijkstra is good for you :-)) -- Zalman Stern zalman@adobe.com (415) 962 3824 Adobe Systems, 1585 Charleston Rd., POB 7900, Mountain View, CA 94039-7900 There is no lust like the present. +++++++++++++++++++++++++++ >From hall_j@sat.mot.com (Joseph Hall) Date: Thu, 5 May 1994 21:55:49 GMT Organization: Motorola Inc., Satellite Communications Seems it was nagle@netcom.com (John Nagle) who said: >busfield@hurricane.seas.ucla.edu (John D. Busfield) writes: >> Does anyone have an algorithm or code for computing square roots. >>The emphasis is on speed rather than accuracy in that I only need the >>result rounded to the nearest integer (ie sqrt(1000) = 31 or 32). >>Thanks in advance. > > The classic for floating point is to check the exponent and if odd, >shift the mantissa right one bit. Then halve the exponent. [...] You can consult Plaugher's "The Standard C Library" for a portable floating point sqrt. For a positive real number, the algorithm is: 1) separate exponent and mantissa -- basically bit-masking except in cases of gradual underflow 2) compute a quadratic least-squares fit to sqrt for the mantissa (y = (-.1984742*x+.8804894)*x + .3176687) 3) follow with 3 iterations of Newton's method (y = .5*(y+x/y)) 4) if exponent is odd, a) multiply mantissa by sqrt(2) and b) subtract 1 from exponent 5) divide exponent by 2 6) reassemble exponent and mantissa Right-shifting the mantissa for odd exponents before computing the root, to eliminate the multiplication by sqrt(2), is harder to express portably, but is certainly possible. Plaugher doesn't make any particular efforts to replace things like ".5 * x" with shifts. Handy little book. ISBN 0-13-131509-9. (begin digression) Plaugher uses conventional floating-point techniques to compute transcendental functions. This is pretty efficient, and these days may be as good as any technique. However, there are also algorithms that allow you to compute trigonometric functions, as well as exponent and log, by using only shifts and adds/subtracts--as if you were doing a divide. In fact, the computational effort is just twice that of a divide. I have seen relatively few references to these techniques, but they are easy to implement. For example, to compute an exponent, you need a table of ln(101(2)), ln(11(2)), ln(1(2)), ln(1.1(2)), ln(1.01(2)), ln(1.001(2)), etc. Then you can code a fixed-point version like this (please excuse typos since I can't paste this directly in here at the moment): unsigned long logTbl[] = { /* binary ln((2^i + 65536)/65536), for i == 32..1), e.g. */ 0xa65b1, 0x9b441, 0x902d3 ... /* ln 2147549184/65536, ln 1073807360/65536, ln 536936448/65536 */ }; unsigned long fexp(unsigned long x) { long i = 1L, j = 0L, k = 0x10000L, l = 0L, *lt = logTbl - 1; while (i <= 32) { j = l + lt[i]; if (j > x) { i++; } else { l = j; k += k >> i; } } return k; } This computes the exponent iteratively in k. The log of the approximation is maintained in l. If l + the current entry in the table is <= than the value you are computing the exponent of, then you add the log in the table to l, and multiply k by a number that is conveniently in the form 1 + some power of 2. Repeat until the table is exhausted. These days this may not be any faster than using the coprocessor. However, this technique was occasionally used to write very high-performance libraries for 8- and early 16-bit processors. I doubt it was ever used in any commercial BASIC, though, because it is not well known, and BASIC would have been a hell of a lot faster for numerical work if it had been. I first heard this technique described in the mid 70s, but the literature goes back to at least the 60s. I wonder if this technique has ever been used to implement transcendental functions in microcode. At one time I was thinking of trying to get it working on a TI bit slice CPU, that is, build my own coprocessor. :-) -- Joseph Nathan Hall | Joseph's Law of Interface Design: Never give your users Software Architect | a choice between the easy way and the right way. Gorca Systems Inc. | joseph@joebloe.maple-shade.nj.us (home) (on assignment) | (602) 732-2549 (work) Joseph_Hall-SC052C@email.mot.com +++++++++++++++++++++++++++ >From Ron_Hunsinger@bmug.org (Ron Hunsinger) Date: Sat, 7 May 94 13:41:39 PST Organization: Berkeley Macintosh Users Group parkb@bigbang.Stanford.EDU (Brian Park) writes: >But let me empasize that even with straight Newton's method is O(log_2(n)), >Sqrt(10^9) takes only 19 iterations, as you can verify. That's only 19 >integer divisions. How fast do you want your square root to be? I want it to be as fast as a single integer division. Why take 19 times as long as I have to? zstern@adobe.com (Zalman Stern) writes: >Ron_Hunsinger@bmug.org (Ron Hunsinger) writes >> [And if you have a very fast divide, as fast as a shift-subtract, as >> might happen if you use a hardware divide instruction on a RISC computer, > >This is not even close to true for any RISC implementation I know of. Glad to hear it. I only included the comment because I did not have instruction timings handy, and I was afraid someone was going to come along and say I needn't worry about avoiding divides because on the PowerPC (or some other RISC computer) EVERYTHING, including divide, took only a single cycle, and so divide was just as fast as shift. I knew shift couldn't be slower than divide, and if it really is faster (which is reasonable), then finding a square root algorithm that does not do any divides is a worthwhile objective. >I believe the integer square root routine I use is based on the algorithm >Ron referred to. It has the following comment at the top (thanks to Mark >Hamburg): > >/* > * A square-root routine. See Dijkstra, "A Discipline of Programming", > * page 65. > */ A distinct possibility, since I read all the Dijkstra books I could get my hands on, way back when. However, I knew that square root could be done as quickly as division at least as far back as high school, before I had ever heard of Dijkstra (or seen a computer for that matter). As I said, the method suggests itself as soon as you assail to calculate square root using the traditional pencil and paper method in binary. >I threw in an optimization to use the PowerPC count-leading-zeros >instruction to efficiently reduce the number of iterations. (There is a >similar instruction on the 68K.) There is? Is this a 68040 instruction? I can't find any such instruction at least through the 68030. I thought the 68040 was just a 68030 with a bigger cache and a built-in FPU. Sure would be nice, though. The Unisys A-series computers have a similar instruction, accessible through the built-in FIRSTONE function in Burroughs Extended Algol, which I found many interesting (and not always obvious) uses for. >The result is competitive with a lookup >table based method for computing sqrt(x^2 + y^2) that uses a divide. (I'd >post the code, but I didn't write it and I'm sure Adobe would frown on my >doing so. Besides, reading Dijkstra is good for you :-)) No problem. I'll post mine, which I wrote from scratch before checking your Dijkstra reference. My routine and his, neither of which uses divides (except by 2 or 4), bear only a vague similarity. (But reading Dijkstra is still good for you.) The whole thing is as fast as a single divide done by subroutine (as, for example, would be required to divide 32-bit integers on a 68000). Normally, a division subroutine will keep the divisor fixed and shift the dividend past it, doing subtractions (and incrementing the quotient) as needed. My first thought was to do the same thing, but that's messy to do in C. (In assembler, you would use LSL, ROXL to shift bits out of one register into another; but that doesn't map well to C.) So instead, I keep the dividend fixed, and shift the divisor right. Just as in the pencil and paper square root method, you take the digits (bits) of the argument two at a time. But since the quotient (a.k.a. the square root) is accumulating bits one at a time in the opposite direction, it ends up only shifting one bit at a time. Without further comment, my version (which is meant to be fast, not necessarily readable) follows: #define BITS 32 // BITS is the smallest even integer such that 2^BITS > ULONG_MAX. // If you port this routine to a computer where 2^(BITS-1) > ULONG_MAX // (Unisys A-series, for example, where integers would generally be // stored in the 39-bit mantissa of a 48-bit word), unwind the first // iteration of the loop, and special-case references to 'half' in that // first iteration so as to avoid overflow. unsigned long lsqrt (unsigned long n) { // Compute the integer square root of the integer argument n // Method is to divide n by x computing the quotient x and remainder r // Notice that the divisor x is changing as the quotient x changes // Instead of shifting the dividend/remainder left, we shift the // quotient/divisor right. The binary point starts at the extreme // left, and shifts two bits at a time to the extreme right. // The residue contains n-x^2. (Within these comments, the ^ operator // signifies exponentiation rather than exclusive or. Also, the / // operator returns fractions, rather than truncating, so 1/4 means // one fourth, not zero.) // Since (x + 1/2)^2 == x^2 + x + 1/4, // n - (x + 1/2)^2 == (n - x^2) - (x + 1/4) // Thus, we can increase x by 1/2 if we decrease (n-x^2) by (x+1/4) unsigned long residue; // n - x^2 unsigned long root; // x + 1/4 unsigned long half; // 1/2 residue = n; // n - (x = 0)^2, with suitable alignment root = 1 << (BITS - 2); // x + 1/4, shifted left BITS places half = root + root; // 1/2, shifted left BITS places do { if (root <= residue) { // Whenever we can, residue -= root; // decrease (n-x^2) by (x+1/4) root += half; } // increase x by 1/2 half >>= 2; // Shift binary point 2 places right root -= half; // x{+1/2}+1/4 - 1/8 == x{+1/2}+1/8 root >>= 1; // 2x{+1}+1/4, shifted right 2 places } while (half); // When 1/2 == 0, binary point is at far right // Omit the following line to truncate instead of rounding if (root < residue) ++root; return root; // Guaranteed to be correctly rounded (or truncated) } Enjoy, -Ron Hunsinger +++++++++++++++++++++++++++ >From Ron_Hunsinger@bmug.org (Ron Hunsinger) Date: Sun, 8 May 94 04:45:40 PST Organization: Berkeley Macintosh Users Group trussler@panix.com (Tom Russler) writes: > > busfield@hurricane.seas.ucla.edu (John D. Busfield) writes: > > Does anyone have an algorithm or code for computing square roots. > >The emphasis is on speed rather than accuracy in that I only need the > >result rounded to the nearest integer (ie sqrt(1000) = 31 or 32). > >Thanks in advance. > >In 68000 assembly in THINK C: > > move.l number, d0 > move.l #1, d1 > move.l d0, d2 > @1 cmp.l d1, d2 > ble @2 > lsl.l #1, d1 > lsr.l #1, d2 > bra @1 > @2 add.l d2, d1 > lsr.l #1, d1 > divu d1, d0 > add.w d1, d0 > add.w #1, d0 > lsr.w #1, d0 Sorry. Valiant effort, but no cigar. You don't handle boundary conditions correctly. Examine what happens when 'number' is 0xFFFFaaaa, where aaaa is any arbitrary bit pattern. > move.l number, d0 ; d0 = 0xFFFFaaaa > move.l #1, d1 ; d1 = 0x00000001 > move.l d0, d2 ; d2 = 0xFFFFaaaa > @1 cmp.l d1, d2 > ble @2 > lsl.l #1, d1 > lsr.l #1, d2 > bra @1 ; after 16 iterations, you break out of the above loop, with: ; d1 = 0x00010000 = 65536 ; d2 = 0x0000FFFF = 65535 > @2 add.l d2, d1 ; d1 = 0x0001FFFF > lsr.l #1, d1 ; d1 = 0x0000FFFF = 65535 > divu d1, d0 <-----OVERFLOW: d0/d1 >= 65536, to big for a short In fact, you should have seen that overflow was a problem. DIVU will always overflow, no matter what the divisor, if the dividend is greater than 65535*65535 = 0xFFFE0001. You may have been counting on the fact that, if overflow is detected, the operands are unaffected, but that doesn't help. The contents of d0.w after the divide is aaaa, which was a completely arbitrary value. Continuing: > add.w d1, d0 <-----OVERFLOW AGAIN: Since d1=65535, the > add.w #1, d0 ; effect of these two instruction is to ; overflow at some point, but otherwise ; do nothing. d0 still contains aaaa > lsr.w #1, d0 ; divide the garbage by 2. The final result is in the range 0..32767. The correct result should be 65535 (if aaaa is 0000) or 65536 (for any other value of aaaa). Again, you should have seen this coming. Your final instruction can never return a value >32767, meaning that you cannot possibly return the correct result for any number > 32767*32768 = 0x3FFF8000. You also do not achieve the stated accuracy goal, "I [only] need the result rounded to the nearest integer", even when you don't overflow. Suppose 'number' = 0x10004000 = 268451840. Your code computes a value of 16794. The correctly rounded value is 16384. Off by 410 is not "the nearest integer". -Ron "who gets very picky sometimes :)" Hunsinger +++++++++++++++++++++++++++ >From fixer@faxcsl.dcrt.nih.gov (Chris Gonna' Find Ray Charles Tate) Date: Mon, 9 May 1994 14:04:10 GMT Organization: DCRT, NIH, Bethesda, MD In article <0014A722.fc@bmug.org>, Ron_Hunsinger@bmug.org (Ron Hunsinger) writes: > >zstern@adobe.com (Zalman Stern) writes: >> >>I threw in an optimization to use the PowerPC count-leading-zeros >>instruction to efficiently reduce the number of iterations. (There is a >>similar instruction on the 68K.) > >There is? Is this a 68040 instruction? I can't find any such instruction >at least through the 68030. I thought the 68040 was just a 68030 with >a bigger cache and a built-in FPU. I think you're looking for the BFFFO instruction, which my 680x0 family reference describes as "Find First One in Bit Field," and says it's a 60820 instruction (and later, of course). The book I'm looking in is Motorola's "M68000 Family Programmer's Reference Manual," which doesn't seem to have any Library of Congress registration (!), but which is Motorola part number M68000PM/AD. - ------------------------------------------------------------- Christopher Tate | "So he dropped the heart - MSD, Inc. | the floor's clean." fixer@faxcsl.dcrt.nih.gov | - Sidney Harris +++++++++++++++++++++++++++ >From christer@cs.umu.se (Christer Ericson) Date: Tue, 10 May 1994 10:24:38 GMT Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden In <00148DC5.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes: > >parkb@bigbang.Stanford.EDU (Brian Park) writes: >>[...suggests using Newton's method...] > >You can do square root much faster than this. I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000- based implementation of Newton's method (with quirks) which is something like 20-30% faster than the equivalent hand-coded optimized assembly version of the routine you posted in another article. It all depends on how good your initial guess for Newton's method is. So there! :-) Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794 Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN +++++++++++++++++++++++++++ >From afcjlloyd@aol.com (AFC JLloyd) Date: 10 May 1994 18:01:03 -0400 Organization: America Online, Inc. (1-800-827-6364) In article <CpL0x4.HKJ@cs.umu.se>, christer@cs.umu.se (Christer Ericson) writes: >In <00148DC5.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes: >> >>parkb@bigbang.Stanford.EDU (Brian Park) writes: >>>[...suggests using Newton's method...] >> >>You can do square root much faster than this. > >I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000- >based implementation of Newton's method (with quirks) which is something >like 20-30% faster than the equivalent hand-coded optimized assembly >version of the routine you posted in another article. It all depends on >how good your initial guess for Newton's method is. I agree with Christer. Until this thread I had been using Newton's method with 5 iterations for obtaining the square root of unsigned longs, where the initial estimate came from finding the most signficant bit, using that to estimate Log2(n), and using that to estimate sqrt(n). When I saw this thread I tried the code attributed to Dr. John Bruner (similar to Ron's code) and saw a 40% speedup over my Newton's method code. Convinced that this was a precious find, I then tried to apply the same technique to taking the square root of Fixed point values and discovered that it couldn't beat my Newton's method code. The reason was primarily that 24 (instead of 16) iterations are required for fixed point square root (the Fract case would require 31 iterations). This caused me to do more thinking about Newton's method. The great thing about Newton's method is that it _doubles_ the precision with each iteration. Of course, double of nothing is nothing, so Newton's method is worthless if you give it a poor estimate, but given a good estimate Newton's method is fantastic. So, I decided to use a table lookup to provide the initial estimate for Newton's method. I found that I could reduce the number of iterations down from five to _one_ for the integer case, and just _two_ iterations for both the Fixed and Fract cases. The resulting code is significantly faster than any of my previous attempts. Jim Lloyd afcjlloyd@aol.com +++++++++++++++++++++++++++ >From Ron_Hunsinger@bmug.org (Ron Hunsinger) Date: Sat, 14 May 94 03:01:56 PST Organization: Berkeley Macintosh Users Group christer@cs.umu.se (Christer Ericson) writes: >In <00148DC5.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes: >> >>parkb@bigbang.Stanford.EDU (Brian Park) writes: >>>[...suggests using Newton's method...] >> >>You can do square root much faster than this. > >I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000- >based implementation of Newton's method (with quirks) which is something >like 20-30% faster than the equivalent hand-coded optimized assembly >version of the routine you posted in another article. It all depends on >how good your initial guess for Newton's method is. You've got me mixed up with somebody else. I did say you can do square root much faster, and explained how, but I didn't post any assembly version. I posted the C version (which therefore has the advantage that it can be compiled native for PowerPC, or ported easily to another machine that doesn't even have to be binary, although of course it is unlikely that shifting will be fast on a non-binary machine). I saw two assembly versions. One, which was quite long, very tightly optimized, and attributed to Motorola, was indeed faster than my short simple C version, but not by all that much. Also, it always truncated, where mine would produce the correctly rounded result (or the truncated result by commenting out one line if that's really what you wanted). The other, very short assembler version was hopelessly flawed. It computed an initial guess that could be off by a factor of two, and then did *ONE* iteration of Newton's method, and stopped there, producing a value that was not even close (in absolute terms), even when it did not lose all significance due to overflow. If asked to take the square root of zero, it would divide by zero. It probably was fast (when it didn't crash), but who cares how quickly you get the wrong answer? Which version was yours? I'll be charitable and assume it was the Motorola version. I no longer have the original message, but I seem to recall the max running time was something like 1262 cycles on a 68000, and (quite a bit) faster on small values. I modified my C version to special-case small values too, and unwind the iterations of the main loop up to the iteration that finds the high bit of the root. The main benefit of special-casing small values is so I don't have to worry about that boundary in the remaining code -- I am then guaranteed that the (unrounded) root is at least two bits long. And although I am not displeased by the improved efficiency, my main reason for unwinding the first iterations was to remove the restriction in the prior version that it would only work on machines in which the number of bits in an unsigned long was even. This code will now work even on machines with odd integer sizes. (Yes, Virginia, such machines do exist! An example is the Unisys A-series computers, in which ALL arithmetic is done in floating point, and integers are just the special case where the exponent is zero and the integer value is stored in the 39-bit mantissa of a 48-bit word.) My improved C version appears at the end of this message. I compiled it with MPW C with no special optimizations, tested it thoroughly, and then disassembled the code and calculated the running time on a 68000. (On later machines, the presence of the cache makes accurate timings much more difficult to do by hand.) I did not make any attempt to "improve" the C-generated code, although there are some easy and obvious optimizations available. The running time for my version (the one that correctly rounds), counting the RTS and the (quite unnecessary) LINK/MOVEM.L/.../MOVE.L <rslt>,D0/MOVEM.L/UNLK is: If N <= 12 then 204+4N, else 620+46u+32z+6r, where N = input argument u = Number of one-bits in the unrounded root z = Number of non-leading zero-bits in the unrounded root r = 1 if rounding increases the root, else 0 The actual running time for various values of N is: N Time Comments --------- ----- ------------------------- 0 204 Best time 13 718 Best time not using lookup table 625 822 This is a cut-point for the Motorola code 0..65535 935 Average* time over inputs that fit in 16 bits 65535 994 Worst time if N fits in 16 bits 0..0xFFFFFFFF 1247 Average* time over all 32-bit inputs 0xFFFFFFFF 1362 Worst case *average times computed assuming uniform distrubution of outputs If my memory serves me right, this is only about 8% slower than the hand-crafted Motorola code, which I think is a cheap price to pay for the portability. Especially considering that most of that difference is due to the fact that they split the general loop into a main loop that could get away with word-size operations for some variables, and a few unwound iterations that had to use long-sized operands. Notice that this is counterproductive on 68020 and up, where long operations are just as fast as word operations when the operands are in registers, and the only effect of unwinding the last two iterations is to guarantee that they won't be in the cache. The compiled 68K code fits in 104 bytes (decimal). (96 bytes without the debugger symbol. This really was a vanilla compile.) How big was your code? How big is the code cache on your machine? How much of your caller's code do you push out of the cache? >So there! :-) Same to you!! :-) #define lsqrt_max4pow (1UL << 30) // lsqrt_max4pow is the (machine-specific) largest power of 4 that can // be represented in an unsigned long. unsigned long lsqrt (unsigned long n) { // Compute the integer square root of the integer argument n // Method is to divide n by x computing the quotient x and remainder r // Notice that the divisor x is changing as the quotient x changes // Instead of shifting the dividend/remainder left, we shift the // quotient/divisor right. The binary point starts at the extreme // left, and shifts two bits at a time to the extreme right. // The residue contains n-x^2. (Within these comments, the ^ operator // signifies exponentiation rather than exclusive or. Also, the / // operator returns fractions, rather than truncating, so 1/4 means // one fourth, not zero.) // Since (x + 1/2)^2 == x^2 + x + 1/4, // n - (x + 1/2)^2 == (n - x^2) - (x + 1/4) // Thus, we can increase x by 1/2 if we decrease (n-x^2) by (x+1/4) unsigned long residue; // n - x^2 unsigned long root; // x + 1/4 unsigned long half; // 1/2 residue = n; // n - (x = 0)^2, with suitable alignment // if the correct answer fits in two bits, pull it out of a magic hat #ifndef lsqrt_truncate if (residue <= 12) return (0x03FFEA94 >> (residue *= 2)) & 3; #else if (residue <= 15) return (0xFFFEAA54 >> (residue *= 2)) & 3; #endif root = lsqrt_max4pow; // x + 1/4, shifted all the way left // half = root + root; // 1/2, shifted likewise // Unwind iterations corresponding to leading zero bits while (root > residue) root >>= 2; // Unwind the iteration corresponding to the first one bit // Operations have been rearranged and combined for efficiency // Initialization of half is folded into this iteration residue -= root; // Decrease (n-x^2) by (0+1/4) half = root >> 2; // 1/4, with binary point shifted right 2 root += half; // x=1. (root is now (x=1)+1/4.) half += half; // 1/2, properly aligned // Normal loop (there is at least one iteration remaining) do { if (root <= residue) { // Whenever we can, residue -= root; // decrease (n-x^2) by (x+1/4) root += half; } // increase x by 1/2 half >>= 2; // Shift binary point 2 places right root -= half; // x{+1/2}+1/4 - 1/8 == x{+1/2}+1/8 root >>= 1; // 2x{+1}+1/4, shifted right 2 places } while (half); // When 1/2 == 0, bin. point is at far right #ifndef lsqrt_truncate if (root < residue) ++root; // round up if (x+1/2)^2 < n #endif return root; // Guaranteed to be correctly rounded (or truncated) } -Ron Hunsinger +++++++++++++++++++++++++++ >From christer@cs.umu.se (Christer Ericson) Date: Mon, 16 May 1994 08:44:47 GMT Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden In <0014E819.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes: > >christer@cs.umu.se (Christer Ericson) writes: >>[...] >>I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000- >>based implementation of Newton's method (with quirks) which is something >>like 20-30% faster than the equivalent hand-coded optimized assembly >>version of the routine you posted in another article. It all depends on >>how good your initial guess for Newton's method is. > >You've got me mixed up with somebody else. I did say you can do square >root much faster, and explained how, but I didn't post any assembly >version. I posted the C version (which therefore has the advantage >that it can be compiled native for PowerPC, or ported easily to another >machine that doesn't even have to be binary, although of course it is >unlikely that shifting will be fast on a non-binary machine). Gosh! What if you read what I wrote above. I didn't claim your code was written in assembly. I refer rather clearly to highly optimized assembly code _equivalent_ to the C code you posted. >I saw two assembly versions. One, which was quite long, very tightly >optimized, and attributed to Motorola, was indeed faster than my short >simple C version, but not by all that much. Also, it always truncated, >where mine would produce the correctly rounded result (or the truncated >result by commenting out one line if that's really what you wanted). > >The other, very short assembler version was hopelessly flawed. It >computed an initial guess that could be off by a factor of two, and >then did *ONE* iteration of Newton's method, and stopped there, producing >a value that was not even close (in absolute terms), even when it did >not lose all significance due to overflow. If asked to take the square >root of zero, it would divide by zero. It probably was fast (when it >didn't crash), but who cares how quickly you get the wrong answer? > >Which version was yours? Neither of these. Instead of arguing about some program you thought I wrote, wouldn't it have been better (as well as making you look better) if you had looked my article up, or in the case it had expired at your site, requested a copy from me before making yourself silly in public? Instead of commenting on your rather irrelevant comments on some code that wasn't mine, I've appended below a copy of my code and if you care to compare it to your code, I'll take the time to respond to any comments you might have (size, speed, what have you). Fair? Finally, the point I wanted to make was that an optimized version of Newton's method very well beats an optimized version of the type of algorithm your program is a version of. I'm not interested in any "...b-b-b-but C is more portable..." arguments. - - cut here --- unsigned short ce_quick_sqrt(n) register unsigned long n; { asm { ;------------------------- ; ; Routine : ISQRT; integer square root ; I/O parameters: d0.w = sqrt(d0.l) ; Registers used: d0-d2/a0 ; ; This is a highly optimized integer square root routine, based ; on the iterative Newton/Babylonian method ; ; r(n + 1) = (r(n) + A / R(n)) / 2 ; ; Verified over the full interval [0x0L,0xFFFFFFFFL] ; ; Some minor compromises have been made to make it perform well ; on the 68000 as well as 68020/030/040. This routine outperforms ; the best of all other algorithms I've seen (except for a table ; lookup). ; ; Copyright (c) Christer Ericson, 1993. All rights reserved. ; ; Christer Ericson, Dept. of Computer Science, University of Umea, ; S-90187 UMEA, SWEDEN. Internet: christer@cs.umu.se ; ;------------------------- ; ; Macintosh preamble since THINK C passes first register param ; in d7. Remove this section on any other machine ; move.l d7,d0 ;------------------------- ; ; Actual integer square root routine starts here ; move.l d0,a0 ; save original input value for the iteration beq.s @exit ; unfortunately we must special case zero moveq #2,d1 ; init square root guess ;------------------------- ; ; Speed up the process of finding a power of two approximation ; to the actual square root by shifting an appropriate amount ; when the input value is large enough ; ; If input values are word only, this section can be removed ; move.l d0,d2 swap d2 ; do [and.l 0xFFFF0000,d2] this way to... tst.w d2 ; go faster on 68000 and to avoid having to... beq.s @skip8 ; reload d2 for the next test below move.w #0x200,d1 ; faster than lsl.w #8,d1 (68000) lsr.l #8,d0 @skip8 and.w #0xFE00,d2 ; this value and shift by 5 are magic beq.s @skip4 lsl.w #5,d1 lsr.l #5,d0 @skip4 ;------------------------- ; ; This finds the power of two approximation to the actual square root ; by doing the integer equivalent to sqrt(x) = 2 ^ (log2(x) / 2). This ; is done by shifting the input value down while shifting the root ; approximation value up until they meet in the middle. Better precision ; (in the step described below) is gained by starting the approximation ; at 2 instead of 1 (this means that the approximation will be a power ; of two too large so remember to shift it down). ; ; Finally (and here's the clever thing) since, by previously shifting the ; input value down, it has actually been divided by the root approximation ; value already so the first iteration is "for free". Not bad, eh? ; @loop add.l d1,d1 lsr.l #1,d0 cmp.l d0,d1 bcs.s @loop @skip lsr.l #1,d1 ; adjust the approximation add.l d0,d1 ; here we just add and shift to... lsr.l #1,d1 ; get the first iteration "for free"! ;------------------------- ; ; The iterative root value is guaranteed to be larger than (or equal to) ; the actual square root, except for the initial guess. But since the first ; iteration was done above, the loop test can be simplified below ; (Oh, without the bvs.s the routine will fail on most large numbers, like ; for instance, 0xFFFF0000. Could there be a better way of handling these, ; so the bvs.s can be removed? Nah...) ; @loop2 move.l a0,d2 ; get original input value move.w d1,d0 ; save current guess divu.w d1,d2 ; do the Newton method thing bvs.s @exit ; if div overflows, exit with current guess add.w d2,d1 roxr.w #1,d1 ; roxr ensures shifting back carry overflow cmp.w d0,d1 bcs.s @loop2 ; exit with result in d0.w @exit } } Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794 Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN +++++++++++++++++++++++++++ >From Ron_Hunsinger@bmug.org (Ron Hunsinger) Date: Tue, 24 May 94 23:23:06 PST Organization: Berkeley Macintosh Users Group christer@cs.umu.se (Christer Ericson) writes: >In <0014E819.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes: >> >>christer@cs.umu.se (Christer Ericson) writes: >>>[...] >>>I beg to differ. In article <CpA23t.DwM@cs.umu.se> I posted an 68000- >>>based implementation of Newton's method (with quirks) which is something >>>like 20-30% faster than the equivalent hand-coded optimized assembly >>>version of the routine you posted in another article. It all depends on >>>how good your initial guess for Newton's method is. >> >>You've got me mixed up with somebody else. I did say you can do square >>root much faster, and explained how, but I didn't post any assembly >>version. I posted the C version (which therefore has the advantage >>that it can be compiled native for PowerPC, or ported easily to another >>machine that doesn't even have to be binary, although of course it is >>unlikely that shifting will be fast on a non-binary machine). > >Gosh! What if you read what I wrote above. I didn't claim your code was >written in assembly. I refer rather clearly to highly optimized assembly >code _equivalent_ to the C code you posted. I did read what you wrote. You have to admit that it isn't all that clear. One could, and I did, read it that you thought I had posted an "equivalent hand-coded optimized assembly". But as long as we're on the subject of carefully reading each other's notes, lets look again at what I said. I said that Newton/Raphson was not the fastest ALGORITHM for taking square roots, because it uses several division steps, and there is another algorithm (which I described) which is as fast as a single division. I also made it clear that in making that comparison, I was comparing only similar implementations of the two operations: my sqare root method implemented entirely in software is as fast and about the same size as division implemented entirely in software; my square root method implemented entirely in hardware is as fast and uses about the same number of logic gates as division implemented entirely in hardware. Comparing my wholly-software IMPLEMENTATION to your implementation using hardware divide is unfair and irrelevant. I stand by my claim. Square root is no harder or slower than division. And there is an algorithm for square root that will beat Newton/Raphson hands down whenever both algorithms are given similar implementations. To back up my claim, consider that ANY software-only implementation of Newton/Raphson would have to, as a minimum, test for a zero input (to avoid dividing by zero), and do at least one division. That is, it would have to be at least as complex as the following routine to compute a sort-of reciprocal: unsigned long lrecip (unsigned long n) { return (n == 0) ? 0 : 0xFFFFFFFF / n; } On a 68000, the division has to be done in software, because the 68000 does not have a DIVU.L instruction. Examining the code generated by this routine shows that its worst-case running time on a 68000 is 1312 clocks. My square-root routine takes at most 1362 clocks. Close enough? >Finally, the point I wanted to make was that an optimized version of >Newton's method very well beats an optimized version of the type of >algorithm your program is a version of. I'm not interested in any >"...b-b-b-but C is more portable..." arguments. But I DO care that C is more portable, and it is definitely relevant to the comparison. Our versions were not optimized equivalently, because you are focusing only on optimization to a particular piece of silicon that happens to implement division, but not square root, in hardware. Take our implementations unchanged, and run them on a PowerPC, and mine will leave yours in the dust. ("Unchanged" means I get to compile mine native - that's what I bought when I paid the price of being portable. "Unchanged" means you have to have yours emulated - that's what you bought when you decided to write processor-dependent code.) Or take our implementations unchanged and run them on an Intel processor. Watch mine still run at acceptable speed. Watch yours not run at all. I also care that mine meets the design specs. The original requestor said he wanted to compute square root "rounded to the nearest integer". My method rounds, and rounds correctly in every case. Yours, and all the other methods I've seen, truncate. I can truncate too, if that's what you want, but fixing yours to round takes a little work. Essentially, rounding involves comparing the unrounded root with the remainder from the last division. If you implement Newton/Raphson in C or Pascal, that remainder is not available to you, and you have to do at least another multiply and subtract to get it. If done in 68K Assembler, the remainder is available, so rounding won't slow you down much, but it will complicate your logic some, and of course there's that portability issue you don't want to talk about. Either way, remember that rounding may produce a result of 65536, so the routine has to return an unsigned long, just like the requestor asked for, instead of an unsigned short. >I've appended below a copy of my code and if you >care to compare it to your code, I'll take the time to respond to >any comments you might have (size, speed, what have you). Fair? Well OK, as long as you keep in mind we're comparing apples to oranges. Before looking at speed, there is the more important issue of correctness. As nearly as I can tell, you always return the correct value (except for not rounding). I won't quite go so far as to say the code is all correct, because you make some early mistakes that get compensated for later: o You test for a zero input by doing a BEQ.S right after moving the input to register A0 for safekeeping. Are you aware that moving to an A register does not affect the condition codes? Fortunately, the condition codes are still set correctly from the prior move of the same data from D7 to D0. However, you have a comment that says that move can be eliminated. If someone does, they might be in trouble. o Your logic to do large initial shifts is faulty. On some inputs (like 0x02000000) you overshift. On other inputs (in the range 512..65535) I think you are not shifting as far as you think you are shifting. (The logic around label @skip8 is muddled.) Fortunately, the Newton method is robust enough to compensate for a bad initial guess, and you still arrive at the correct answer, although not as efficiently as you could have. Maybe you don't care about either of these points - after all, you already said you didn't care about portability - but has no one ever pointed out to you that is much easier to write provably correct code in a high level language? Does the fact that your code does not work the way you think it does suggest anything? It's hard to estimate your average running time, because there are so many paths through the optimization and it's difficult to assign proper weights to each path, so I'll settle for comparing worst cases. o Your worst case is with the input 0x02000000, for which you have to do 4 divisions. You take a total of 994 clocks on a 68000, or 459 clocks on a 68020. o My worst case is with the input 0xFFFFFFFF. I take 1362 clocks on a 68000, 609 on a 68020. 1362 / 994 = 1.37. You are 37% faster than I am on a 68000. 609 / 459 = 1.33. You are 33% faster than I am on a 68020. Truth be told, I'm not unhappy that my portable software solution fares so well against your non-portable hardware solution, even on the hardware of your choice. Your speed advantage stems from the fact that you are coding in Assembler, not from any advantage of the algorithm. If you implemented your algorithm in C, then either your code would be calling a division subroutine (and your performance would go to hell), or you could specifically tell the compiler to generate 68020 code to get inline divide instructions. But then, besides no longer running on a 68000, you would still come out behind because the compiler would be compelled to generate a DIVU.L instruction instead of your DIVU.W, and that difference alone is enough to dissipate all your advantage. Add the extra multiply you need to do correct rounding and you're behind. -Ron Hunsinger +++++++++++++++++++++++++++ >From christer@cs.umu.se (Christer Ericson) Date: Thu, 2 Jun 1994 08:17:18 GMT Organization: Dept. of Computing Science, Umea Univ., 901 87 Umea, Sweden In <00155867.fc@bmug.org> Ron_Hunsinger@bmug.org (Ron Hunsinger) writes: >[...] I also made it clear that in >making that comparison, I was comparing only similar implementations >of the two operations: my sqare root method implemented entirely in >software is as fast and about the same size as division implemented >entirely in software; my square root method implemented entirely in >hardware is as fast and uses about the same number of logic gates as >division implemented entirely in hardware. OK, I can agree with that. But apart from that... well, see below. >Comparing my wholly-software IMPLEMENTATION to your implementation using >hardware divide is unfair and irrelevant. I stand by my claim. Square >root is no harder or slower than division. And there is an algorithm >for square root that will beat Newton/Raphson hands down whenever both >algorithms are given similar implementations. It's neither unfair nor irrelevant. You're living in a dream world. In the real world (where the rest of us live) there's no such thing as "similar implementations". The better (which equals to faster in this case) implementation wins, regardless of whether it uses all available machine instructions or not. In the real world do you always cry, "Not fair! Not fair! You used hardware division and I didn't <stomp of feet>."? Hey, good luck to you! >But I DO care that C is more portable, and it is definitely relevant to >the comparison. Our versions were not optimized equivalently, because >you are focusing only on optimization to a particular piece of silicon >that happens to implement division, but not square root, in hardware. >Take our implementations unchanged, and run them on a PowerPC, and mine >will leave yours in the dust. ("Unchanged" means I get to compile mine >native - that's what I bought when I paid the price of being portable. >"Unchanged" means you have to have yours emulated - that's what you >bought when you decided to write processor-dependent code.) Or take >our implementations unchanged and run them on an Intel processor. Watch >mine still run at acceptable speed. Watch yours not run at all. You're being silly! I use my code as posted on a 68K machine, I use another (general) piece of code on another machine. Ever heard the word #ifdef? >I also care that mine meets the design specs. The original requestor >said he wanted to compute square root "rounded to the nearest integer". >My method rounds, and rounds correctly in every case. Yours, and all >the other methods I've seen, truncate. I can truncate too, if that's >what you want, but fixing yours to round takes a little work. Essentially, >rounding involves comparing the unrounded root with the remainder from >the last division. If you implement Newton/Raphson in C or Pascal, that >remainder is not available to you, and you have to do at least another >multiply and subtract to get it. If done in 68K Assembler, the remainder >is available, so rounding won't slow you down much, but it will complicate >your logic some, and of course there's that portability issue you don't >want to talk about. Either way, remember that rounding may produce a >result of 65536, so the routine has to return an unsigned long, just >like the requestor asked for, instead of an unsigned short. I didn't post my article as a reply to the original post, but as a proof that a carefully coded N/B-approach very well outperforms (in terms of execution time) the algebraic algorithm your code is a version of. >>[I wrote:] >>I've appended below a copy of my code and if you >>care to compare it to your code, I'll take the time to respond to >>any comments you might have (size, speed, what have you). Fair? > >Well OK, as long as you keep in mind we're comparing apples to >oranges. I think it's interesting to note that when you compared your code to Jim Cathey's code (which you erroneously thought was mine), then you thought size was important. Was that only because your code was smaller than Cathey's? Now that my code is smaller than yours, it seems that you don't think size is important no longer. Nor did you bother to compare average times. And this from someone who cries "fair" all the time... >Before looking at speed, there is the more important issue of >correctness. As nearly as I can tell, you always return the correct >value (except for not rounding). I won't quite go so far as to say the >code is all correct, because you make some early mistakes that get >compensated for later: > > o You test for a zero input by doing a BEQ.S right after moving the > input to register A0 for safekeeping. Are you aware that moving to > an A register does not affect the condition codes? Fortunately, > the condition codes are still set correctly from the prior move > of the same data from D7 to D0. However, you have a comment that > says that move can be eliminated. If someone does, they might be > in trouble. Yes this is true. That comment is really a mistake. I added it in haste when posting the code to comp.sys.m68k some while ago, when there were a discussion of fast implementations of integer square roots. The comment is faulty. The code is, and was, correct, however. > o Your logic to do large initial shifts is faulty. On some inputs (like > 0x02000000) you overshift. On other inputs (in the range 512..65535) > I think you are not shifting as far as you think you are shifting. > (The logic around label @skip8 is muddled.) Fortunately, the Newton > method is robust enough to compensate for a bad initial guess, and > you still arrive at the correct answer, although not as efficiently > as you could have. Harsh words from someone who doesn't bother to attribute code to the correct person. There is nothing wrong with the section of code you're refering to. That code is there to reduce the number of iterations in the section just below, and to _reduce_it_as_much_as_possible_on_average_ _without_over-complicating_the_code_. Now, say again, is it correct or is it faulty? If you consider it to be faulty, I'd very much like to see you post what you consider to be a correct version! >Does the fact that your code does not work the way you think it >does suggest anything? While we're being sarcastic, I'd like to ask you if the fact that you find errors that aren't there suggest anything to you? >It's hard to estimate your average running time, because there are >so many paths through the optimization and it's difficult to assign >proper weights to each path, so I'll settle for comparing worst cases. > > o Your worst case is with the input 0x02000000, for which you have to > do 4 divisions. You take a total of 994 clocks on a 68000, or > 459 clocks on a 68020. > > o My worst case is with the input 0xFFFFFFFF. I take 1362 clocks on > a 68000, 609 on a 68020. > >1362 / 994 = 1.37. You are 37% faster than I am on a 68000. >609 / 459 = 1.33. You are 33% faster than I am on a 68020. Well, since you didn't bother running the code, I did. On an IIcx I got the following results (all numbers in ticks): R = Ron's code, as posted A = Optimized assembly code equivalent to Ron's code C = My code, as posted [0..65536) [0..500000) 500000 "random" numbers R 142 1088 1135 A 118 906 944 C 95 626 473 142 / 95 = 1.49 1088 / 626 = 1.74 1135 / 473 = 2.4 These figures should be considerably larger on a 68000, if we can believe your figures. (Note that you could squeeze an extra 20% out of your code, which for what is essentially a library routine is quite large. I would use my code though.) >Truth be told, I'm not unhappy that my portable software solution fares >so well against your non-portable hardware solution, even on the hardware >of your choice. You were saying? Christer Ericson --- Internet: christer@cs.umu.se --- tel: +46-90-166794 Department of Computer Science, University of Umea, S-90187 UMEA, SWEDEN --------------------------- >From baer@qiclab.scn.rain.com (Ken Baer) Subject: How to save alpha in PICT files? Date: Sat, 28 May 1994 00:24:17 GMT Organization: SCN Research/Qic Laboratories of Tigard, Oregon. I need to save 32-bit images as PICT files that include the upper 8 bits of data (the alpha buffer). My code currently saves as 32-bit PICT files, but the upper 8-bit of data are missing! I am using the methods described in Inside Mac and developer notes. The sequence I use now is as follows: - allocate and write 32-bit data to an offscreen GWorld - OpenPicture() - CopyBits(offgworld, offgworld) (copy the pixmap to itself) - ClosePicture() Is there an extra writing step to jam in the alpha buffer? Is there any sample code (in C) with a more modern method? The sample code I'm working from is pretty dated. Thanks in advance. -Ken Baer. baer@qiclab.scn.rain.com -- \_ -Ken Baer. Programmer/Animator, Hash Inc. <[_] Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042 =# \, Animation Master: Finally, 3D animation software an artist can afford! +++++++++++++++++++++++++++ >From zapper2@netcom.com (Chris Athanas) Date: Sun, 29 May 1994 09:19:37 GMT Organization: NETCOM On-line Communication Services (408 261-4700 guest) Ken Baer (baer@qiclab.scn.rain.com) wrote: : I need to save 32-bit images as PICT files that include the upper 8 bits : of data (the alpha buffer). My code currently saves as 32-bit PICT files, : but the upper 8-bit of data are missing! I am using the methods described : in Inside Mac and developer notes. The sequence I use now is as follows: : - allocate and write 32-bit data to an offscreen GWorld : - OpenPicture() : - CopyBits(offgworld, offgworld) (copy the pixmap to itself) : - ClosePicture() : Is there an extra writing step to jam in the alpha buffer? Is there any : sample code (in C) with a more modern method? The sample code I'm working : from is pretty dated. Under system 6.0, you could use the upper 8 bits as an alpha channel. >From what I understand, that was a bug. Current PICT files only support 24-bit images. The upper 8bits are truncated, and the picture is stored as 24-bits per pixel. To associate an alpha channel with a PICT, you will need to do some custom stuff. I'm not sure what your application is, but if you are doing everything yuorself, you could store the normal picture as a 24-bit pict file (normal PICT file) and create a PICT resource in that file that contains the alpha channel. This way "normal" applications can use your data-fork PICT file normally, and you can use your alpha channel in your own apps. Also, if you are planning on using photoshop, photoshop has a buil-in filter that allows you to take a PICT resource from a file. So your users could still have access to your "custom" alpha channel in photoshop. Another alternative is to use the photoshop standard as your file format. The ps format allows for as many channels as you would like. And photoshop is a pretty standard application now, and is widely used. Hope this helps. Chris Athanas +++++++++++++++++++++++++++ >From baer@qiclab.scn.rain.com (Ken Baer) Date: Mon, 30 May 1994 17:51:58 GMT Organization: SCN Research/Qic Laboratories of Tigard, Oregon. In article <zapper2CqK4Kq.Fn2@netcom.com> zapper2@netcom.com (Chris Athanas) writes: >Ken Baer (baer@qiclab.scn.rain.com) wrote: >: I need to save 32-bit images as PICT files that include the upper 8 bits >: of data (the alpha buffer). My code currently saves as 32-bit PICT files, >: but the upper 8-bit of data are missing! I am using the methods described >: in Inside Mac and developer notes. The sequence I use now is as follows: >: - allocate and write 32-bit data to an offscreen GWorld >: - OpenPicture() >: - CopyBits(offgworld, offgworld) (copy the pixmap to itself) >: - ClosePicture() > >: Is there an extra writing step to jam in the alpha buffer? Is there any >: sample code (in C) with a more modern method? The sample code I'm working >: from is pretty dated. > >Under system 6.0, you could use the upper 8 bits as an alpha channel. >From what I understand, that was a bug. A damn useful bug. I guess 32-bit QuickDraw was a typo too :-(. >Current PICT files only support 24-bit images. The upper 8bits are >truncated, and the picture is stored as 24-bits per pixel. What's the logic behind this? Or as Seinfeld would say, "Who are the programming wizards that came up with this one?!?" >To associate an alpha channel with a PICT, you will need to do some >custom stuff. I'm not sure what your application is, but if you are doing >everything yuorself, you could store the normal picture as a 24-bit pict >file (normal PICT file) and create a PICT resource in that file that >contains the alpha channel. This way "normal" applications can use your >data-fork PICT file normally, and you can use your alpha channel in your >own apps. The application in question is a 3D renderer which normally outputs QuickTime (which has no problem dealing with real 32-bit data), but will also put out Targa, PICS, and PICT. I want PICT files with an alpha channel that can be used primarily in Photoshop, and in other applications that support 32-bit PICT files. >Also, if you are planning on using photoshop, photoshop has a buil-in >filter that allows you to take a PICT resource from a file. So your users >could still have access to your "custom" alpha channel in photoshop. Are you saying that Photoshop uses the alpha in a resource method that you described above? Is this documented anywhere? > >Another alternative is to use the photoshop standard as your file format. >The ps format allows for as many channels as you would like. And >photoshop is a pretty standard application now, and is widely used. I'd rather stick with PICT, though it is easily my leasy favorite image format. It's really a shame that the Mac standard format is also the least portable to other systems. It's a big deal to us since our application also runs on Windows and SGI. > >Hope this helps. Yes, it does. My big question is what is everyone else doing for saving alpha with PICT files? > >Chris Athanas -- \_ -Ken Baer. Programmer/Animator, Hash Inc. <[_] Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042 =# \, Animation Master: Finally, 3D animation software an artist can afford! +++++++++++++++++++++++++++ >From scottsquir@aol.com (ScottSquir) Date: 30 May 1994 18:54:02 -0400 Organization: America Online, Inc. (1-800-827-6364) In article <1994May30.175158.28837@qiclab.scn.rain.com>, baer@qiclab.scn.rain.com (Ken Baer) writes: > My big question is what is everyone else doing for saving >alpha with PICT files? Set the cmpCount in the GWorld pixmap to 4. It defaults to 3. If you resize or do other CopyBits manipulation it may get lost. -scott +++++++++++++++++++++++++++ >From Darrin.Cardani@AtlantaGA.NCR.COM (Darrin Cardani) Date: Tue, 31 May 1994 18:17:38 GMT Organization: UNITY, AT&T GIS In article <1994May30.175158.28837@qiclab.scn.rain.com> baer@qiclab.scn.rain.com (Ken Baer) writes: >Are you saying that Photoshop uses the alpha in a resource method that you >described above? Is this documented anywhere? You can get all the docs on Photoshop's format from archive.umich.edu in the mac/graphics/util directory. (Might be mac/development directory, actually, I'm not sure). >> >>Another alternative is to use the photoshop standard as your file format. >>The ps format allows for as many channels as you would like. And >>photoshop is a pretty standard application now, and is widely used. >Yes, it does. My big question is what is everyone else doing for saving >alpha with PICT files? I'm currently saving them as 2 PICT files. I've debated using 1 file with 2 PICT resources in it. I think I'll end up using the Photoshop method, because it adds usefulness to my program. Also, someone mentioned you could have as many channels as needed with the photoshop format, I believe you can only have up to 16, though. >-- > \_ -Ken Baer. Programmer/Animator, Hash Inc. ><[_] Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042 > =# \, Animation Master: Finally, 3D animation software an artist can afford! Really? What it is and can a demo be downloaded anywhere? Darrin Darrin Cardani Opinions above are mine. Mine! Mine! Mine! Darrin.Cardani@AtlantaGA.NCR.COM +++++++++++++++++++++++++++ >From a () Date: 2 Jun 1994 01:11:24 GMT Organization: Apple Computer, Inc., Cupertino, California In article <1994May30.175158.28837@qiclab.scn.rain.com>, baer@qiclab.scn.rain.com (Ken Baer) wrote: > In article <zapper2CqK4Kq.Fn2@netcom.com> zapper2@netcom.com (Chris Athanas) writes: > >Ken Baer (baer@qiclab.scn.rain.com) wrote: > >: I need to save 32-bit images as PICT files that include the upper 8 bits > >: of data (the alpha buffer). My code currently saves as 32-bit PICT files, > >: but the upper 8-bit of data are missing! I am using the methods described > >: in Inside Mac and developer notes. The sequence I use now is as follows: > >: - allocate and write 32-bit data to an offscreen GWorld > >: - OpenPicture() > >: - CopyBits(offgworld, offgworld) (copy the pixmap to itself) > >: - ClosePicture() > > > >: Is there an extra writing step to jam in the alpha buffer? Is there any > >: sample code (in C) with a more modern method? The sample code I'm working > >: from is pretty dated. > > > >Under system 6.0, you could use the upper 8 bits as an alpha channel. > >From what I understand, that was a bug. > > A damn useful bug. I guess 32-bit QuickDraw was a typo too :-(. > > >Current PICT files only support 24-bit images. The upper 8bits are > >truncated, and the picture is stored as 24-bits per pixel. > > What's the logic behind this? Or as Seinfeld would say, "Who are the > programming wizards that came up with this one?!?" > > >To associate an alpha channel with a PICT, you will need to do some > >custom stuff. I'm not sure what your application is, but if you are doing > >everything yuorself, you could store the normal picture as a 24-bit pict > >file (normal PICT file) and create a PICT resource in that file that > >contains the alpha channel. This way "normal" applications can use your > >data-fork PICT file normally, and you can use your alpha channel in your > >own apps. > > The application in question is a 3D renderer which normally outputs > QuickTime (which has no problem dealing with real 32-bit data), but will > also put out Targa, PICS, and PICT. I want PICT files with an alpha > channel that can be used primarily in Photoshop, and in other applications > that support 32-bit PICT files. > > >Also, if you are planning on using photoshop, photoshop has a buil-in > >filter that allows you to take a PICT resource from a file. So your users > >could still have access to your "custom" alpha channel in photoshop. > > Are you saying that Photoshop uses the alpha in a resource method that you > described above? Is this documented anywhere? > > > > >Another alternative is to use the photoshop standard as your file format. > >The ps format allows for as many channels as you would like. And > >photoshop is a pretty standard application now, and is widely used. > > I'd rather stick with PICT, though it is easily my leasy favorite image > format. It's really a shame that the Mac standard format is also the > least portable to other systems. It's a big deal to us since our > application also runs on Windows and SGI. > > > > >Hope this helps. > > Yes, it does. My big question is what is everyone else doing for saving > alpha with PICT files? > > > > >Chris Athanas > > > -- > \_ -Ken Baer. Programmer/Animator, Hash Inc. > <[_] Usenet: baer@qiclab.scn.rain.com/AppleLink:KENBAER/Office: (206)750-0042 > =# \, Animation Master: Finally, 3D animation software an artist can afford! The technique is described in IM VI pp 17-22 through 17-23; basically the idea is that before calling CopyBits you need to set the packType field in the source pixmap to four and the cmpCount to 4 also. Hope this really helps. Guillermo --------------------------- >From wingo@apple.com (Tony Wingo) Subject: Open Transport & ASLM Date: Thu, 2 Jun 1994 18:06:30 GMT Organization: Apple Computer The following is a response from the Open Transport Team to the recent threads on OpenTransport, ASLM and DLL's in general. Comments may be sent to opentpt@applelink.apple.com ===================================================================== When Open Transport was developed, ASLM was the only general option for dynamic linking - CFM and SOM were not anywhere in sight. In fact, today we still can only use CFM for dynamic linking on PowerPC. It's not available for 68K, and SOM is not available for either platform. Whether or not ASLM is used as the dynamic linking mechanism, the fact that C++ objects generated by different compilers have different formats makes it virtually impossible to export object-oriented interfaces that everyone can use. Yes - SOM will solve this problem, but that solution does not come without a performance penalty - parameters have to be marshalled, and solving the fragile base class problem requires a much more indirect method of dispatching than a jump through a vtable. We really wanted to be able to create an object-oriented framework for writing STREAMS modules, and ASLM allowed us to do that without a performance penalty. Now that CFM and SOM are either here or on the horizon, we've backed off of that goal. >From a client perspective, Open Transport supports a complete procedural interface using Pascal calling conventions for use in the widest range of development environments possible. It also has a C++ object interface, which is usable with MPW Cfront, and with the PPCC Script/tools for PowerPC. We believe that the C++ interface is also usable from Symantec C++ for 68K, but this has not been tested. We have also not tested with the Metrowerks products, but are assuming that, like Symantec, they have a way to import MPW .o and XCOFF files. ASLM is currently used as the dynamically loading mechanism for both client code and STREAMS modules, but neither version REQUIRES using C++ interface. When SOM becomes available as part of the system, we will look at converting our C++ client interfaces to become SOM objects. The procedural interfaces for STREAMS modules will remain ASLM-based for 68K and CFM-based for PowerPC. The bottom line is: If you can't get Open Transport client files to link with your tools using our "C" interfaces, then either your tools don't support MPW-generated libraries, your tools can't deal with Pascal calling conventions, or the Open Transport team has a bug. We have not yet published any information on how to write a STREAM module for Open Transport. As soon as we do, it will tell you how to create both ASLM and CFM-based STREAM modules. I hope this clears up any misunderstandings on using Open Transport. --------------------------- >From kennedy@fenton.cs.udcavis.edu (Kennedy) Subject: Sending the Mac Screen Image Date: Thu, 2 Jun 1994 01:56:33 GMT Organization: University of California, Davis This might be too involved of a question for this news group But I was wondering if anyone had any Ideas. What I need to do is send screen image of a Macintosh so that it can be viewed in all its glory in an X window. The screen should refresh flicker free and should look to the viewer of the X window as is he/she was looking at the Mac monitor. This has to be done using TCP/IP. If you have any Ideas, example code, algorithms, or whether you think it can be done or not, let me know. Thanx. Brian Kennedy (kennedy@fenton.cs.ucdavis.edu) +++++++++++++++++++++++++++ >From rmah@panix.com (Robert S. Mah) Date: Thu, 02 Jun 1994 01:22:22 -0500 Organization: One Step Beyond kennedy@fenton.cs.udcavis.edu (Kennedy) wrote: > This might be too involved of a question for this news group > But I was wondering if anyone had any Ideas. What I need to do is > send screen image of a Macintosh so that it can be viewed in all > its glory in an X window. The screen should refresh flicker free > and should look to the viewer of the X window as is he/she was > looking at the Mac monitor. This has to be done using TCP/IP. > If you have any Ideas, example code, algorithms, or whether > you think it can be done or not, let me know. Thanx. Since you said "in all its glory" I take it you mean color. Let's see...640x480 minimum screen size means 300K per frame. At 24 fps, that's around 7MBytes per second. Too much. You'll have to send screen differences (i.e. just what changed on the screen) to get a decent frame rate. Keep a copy of the screen image on the Mac, note the changes, send it via TCP/IP. Receive it on the UNIX machine. Have the X Server send it to the X Client. You could also patch all of QuickDraw and send the QD command opcodes to the UNIX box. There, you'd have to write a QD emulator and execute the opcodes as they come in. This would be very hard to write, but would work faster than the screen difference method. Have you considered just buying it. I think there are a few products that do this out there. Cheers, Rob ___________________________________________________________________________ Robert S. Mah -=- One Step Beyond -=- 212-947-6507 -=- rmah@panix.com +++++++++++++++++++++++++++ >From David K}gedal <davidk@lysator.liu.se> Date: 2 Jun 1994 11:44:58 GMT Organization: Lysator Academic Computer Society In article <rmah-020694012222@rmah.dialup.access.net> Robert S. Mah, rmah@panix.com writes: >You'll have to send screen differences (i.e. just what changed on the >screen) to get a decent frame rate. Keep a copy of the screen image on >the Mac, note the changes, send it via TCP/IP. Receive it on the UNIX >machine. Have the X Server send it to the X Client. Oops, this is a very common confusion. It is of course the X client that will send it to the X server. The server is the one that controls the actual screen and does the drawing that the clients tells it to do. /David +++++++++++++++++++++++++++ >From d88-jwa@mumrik.nada.kth.se (Jon Wätte) Date: 2 Jun 1994 12:56:48 GMT Organization: The Royal Institute of Technology In <rmah-020694012222@rmah.dialup.access.net> rmah@panix.com (Robert S. Mah) writes: >Let's see...640x480 minimum screen size means 300K per frame. At 24 fps, >that's around 7MBytes per second. Too much. You only need send what's drawn, and you can compress it while you send. There are commercial apps that do EXACTLY this. There are two mechanisms you can use; patching the QuickDraw bottlenecks, or patching ShieldCursor(). QuickDraw does all drawing through bottlenecks, so if you want object graphics, send the bottleneck commands across the network. You have to send port information as well, I guess. QuickDraw also calls ShieldCursor() for the enclosing rectangle of what it draws, and ShowCursor() when it's done, so if you only want a bitmap, use that area, and send the (packed) bitmap once drawing is done (ShowCursor() time) or accumulate it all into a region which you send at WaitNextEvent time. -- -- Jon W{tte, h+@nada.kth.se, Mac Software Engineer Deluxe -- This signature is kept shorter than 4 lines in the interests of UseNet S/N ratio. +++++++++++++++++++++++++++ >From rmah@panix.com (Robert S. Mah) Date: Thu, 02 Jun 1994 11:40:51 -0500 Organization: One Step Beyond David K}gedal <davidk@lysator.liu.se> wrote: > Robert S. Mah, rmah@panix.com writes: > >You'll have to send screen differences (i.e. just what changed on the > >screen) to get a decent frame rate. Keep a copy of the screen image on > >the Mac, note the changes, send it via TCP/IP. Receive it on the UNIX > >machine. Have the X Server send it to the X Client. > > Oops, this is a very common confusion. It is of course the X client that > will send it to the X server. The server is the one that controls the > actual screen and does the drawing that the clients tells it to do. Right. I don't do X, so I'm more than a bit hazy about this stuff, but after you mentioned it, I did recall reading an article about this and saying to myself, "That's supposed to make sense?" So, to correct myself...the X-client would ask the X-server to draw the stuff. Hey...couldn't you simply put the X-client on the Mac? Could be an interesting academic project. Cheers, Rob ___________________________________________________________________________ Robert S. Mah -=- One Step Beyond -=- 212-947-6507 -=- rmah@panix.com --------------------------- >From jaffe@rabbit.cita.utoronto.ca (Andrew Jaffe) Subject: What are Universal Headers? Date: Thu, 19 May 1994 13:49:41 GMT Organization: CITA That's about it: What are the Universal Headers? AJ -- - -------------------------------------------------------------------- Andrew Jaffe jaffe@cita.utoronto.ca CITA, U. Toronto (416) 978-8497, 6879 60 Saint George St. -3921 (Fax) Toronto, ON M5S 1A1 CANADA +++++++++++++++++++++++++++ >From dean@genmagic.com (Dean Yu) Date: 19 May 1994 17:20:31 GMT Organization: General Magic, Inc. In article <JAFFE.94May19094941@rabbit.cita.utoronto.ca>, jaffe@rabbit.cita.utoronto.ca (Andrew Jaffe) wrote: > > That's about it: > What are the Universal Headers? > Universal Headers are the twisted idea of a couple of system software engineers who have since left the company to try to avoid the proliferation of interface files as the Mac API was going cross platform. That is, rather than having a set of header files you compile with if you were doing say a PowerPC application, and different set you use if you were doing a 68K application, you would have one set that can be used for any platform that supplied the Mac API by flipping a couple of compile time switches. Like any grand unifying theory, it was very elusive, long in the coming, and probably a little less all-encompassing than it should have been. -- Dean Yu Negative Ethnic Role Model General Magic, Inc. +++++++++++++++++++++++++++ >From Lars.Farm@nts.mh.se (Lars Farm) Date: Fri, 20 May 1994 09:43:17 +0100 Organization: Mid Sweden University In article <dean-190594101355@dean_yu.genmagic.com>, dean@genmagic.com (Dean Yu) wrote: > In article <JAFFE.94May19094941@rabbit.cita.utoronto.ca>, > jaffe@rabbit.cita.utoronto.ca (Andrew Jaffe) wrote: > > > > That's about it: > > What are the Universal Headers? > > > > Universal Headers are the twisted idea of a couple of system software [...] > Like any grand unifying theory, it was very elusive, long in the coming, > and probably a little less all-encompassing than it should have been. Still, it is very nice to have _one_ set of Mac headers, after all there is only one API. Now it's a lot simpler to move code between compilers. This can't be bad. -- Lars.Farm@nts.mh.se +++++++++++++++++++++++++++ >From slosser@apple.com (Eric Slosser) Date: Sat, 28 May 1994 21:14:40 GMT Organization: Apple Computer, Inc. > In article <dean-190594101355@dean_yu.genmagic.com>, dean@genmagic.com > (Dean Yu) wrote: > > > > > Universal Headers are the twisted idea of a couple of system software > [...] > > Like any grand unifying theory, it was very elusive, long in the coming, > > and probably a little less all-encompassing than it should have been. Well, with the kind of people they had working on the project, what did you expect, Dean? PS. Don't forget the really novel idea of having the C, Pascal, and Asm headers auto-generated from a master file, rather than maintained by hand. -- Eric Slosser / Apple Computer / slosser@apple.com +++++++++++++++++++++++++++ >From jum@anubis.han.de (Jens-Uwe Mager) Date: Thu, 2 Jun 94 01:05:11 MET Organization: (none) In article <slosser-280594131440@mac96.kip.apple.com> (comp.sys.mac.programmer), slosser@apple.com (Eric Slosser) writes: > PS. Don't forget the really novel idea of having the C, Pascal, and Asm > headers auto-generated from a master file, rather than maintained by hand. Hmmm, how does your specification language look like? ______________________________________________________________________________ Jens-Uwe Mager jum@anubis.han.de 30177 Hannover jum@helios.de Brahmsstr. 3 Tel.: +49 511 660238 --------------------------- End of C.S.M.P. Digest **********************