xBRZ - the ultimate xBR style image upscaler/filter
- Version 1.1

This is a collection of rpi-plugins for the Kega Fusion Emulator.
Windows only.
Modified code by "milo1012" (milo1012 AT freenet DOT de)

All original code by "Zenju" at the HqMAME project.
Take a look at
http://sourceforge.net/projects/hqmame
for sample pictures and more info.

xBRZ is in general more detail-preserving than xBR
and essentially better than the HQx filters!


Plugin Versions
---------------

4xBRZ.rpi      4xBRZ - normal version - scales to e.g. 1280x960 (Mega Drive - PAL)
4xBRZ-MT.rpi   4xBRZ - (multi-)threaded version
3xBRZ.rpi      3xBRZ - normal version - scales to e.g. 960x720
3xBRZ-MT.rpi   3xBRZ - (multi-)threaded version
2xBRZ.rpi      2xBRZ - normal version - scales to e.g. 640x480
2xBRZ-MT.rpi   2xBRZ - (multi-)threaded version

The normal version does everything in a single thread, like most other plugins.
I recommend a recent-generation CPU with 3 GHz for the 4x version for 50/60 FPS.
The 3x and 2x version will probably run with much less CPU speed.

The threaded version scales the image in 4 slices, each in a separate thread,
which provides a decent speedup for all systems with at least two CPU cores,
especially for first generation Dual Core CPUs.
(Athlon 64 X2, first/2nd Intel Core gen. and similar)
Avoid the threaded version on Single Core systems,
you won't get any speedup, instead (more likely) even lower performance.
I've tested the threaded version on Kega quite thorough,
but I can't guarantee that it's completely bug-free.
You can also use the plugins for VBA-M, the threaded version
seems to work now (Version 1.1), I did not test it much here though.

The dir "RLUT" contains the same six versions, but compiled with a reverse color-LUT.
This provides an additional ~5 % speed increase.
(at least on the "modern" systems I tested, may be less for older systems)
The drawback is that this requires 64 MiB memory for each plugin version loaded,
which would be 6*64 = 384 MiB when all six plugin versions are loaded by Kega!
(not sure about VBA-M)
I recommend to use just your "main" plugin version in the RLUT-variant,
for a reasonable memory footprint or for low-mem systems, and the remaining
(if you still want or need them in that case) in the normal variant.


Performance
-----------

On my main system (Intel Core i5 3200 MHz)
the normal x4 version works with full speed on Kega.
With on older 2000 MHz Core 2 CPU I get about 30-40 FPS,
but the threaded version is able to run with full 60 FPS constantly.
Below 2 GHz some detailed game scenes might bring
the FPS to <60 for a small amount of time.
A good example for that is the start of the Sonic 2 "Aquatic Ruin Zone"
(just fast forward the Sonic 2 automatic demo to see it).
Very much details in current picture/scene = low performance!

Why is it so slow?
First of all, it is a quite computing-heavy algorithm.
Second, Kega uses a 16 (or 15) bit color depth internally (prior to final output),
so the plugin must convert to RGB24, do the scaling and convert back to 16 bit.

If you have performance problems, especially in the 4x version, use the 3x or 2x
version and let Kega's internal filter do the remaining upscaling to your final
screen resolution (with slightly less good-looking result of course),
or just use the threaded version if you have an older Multi Core system.

Feel free to modify the source code to improve the speed for older systems.


Changes
-------

Version 1.1 (Apr. 17. 2014)
- further speedups due to more code optimizations (see the source for details)
  -> around 15 % faster on most machines
- removal of conversion buffers (were ~10 MiB for the 4x version)
  and now using a color lookup-table instead (< 1 MiB memory)
- output is now copied line-wise directly after computation
  -> seems to fix threaded version for VBA-M
- added reverse color-LUT variant for an additional ~5 % speed increase,
  but requires 64 MiB memory for each plugin, therefore optional
- compiled with TDM-GCC (slight speed increase and now requires i686 compatible CPU)

Visual C++ compilers work fine too, I recommend old Visual C++ .NET 2003,
because this produces faster code than newer versions,
like 2005/08/10/12, probably due to some SIMD/x64 optimizations for these.
There is no SSE required. (true SSE instructions do not seem to work for Kega anyway)


Version 1.0 (Dec. 2013)
- first release

