A test program that’s about 35% complete. It will be a performance test for plain LÖVE draw commands, SpriteBatches and instanced mesh draw calls.

EDIT 2 Jan 2022: I think I greatly misunderstood what love.graphics.drawInstanced() does, so ignore what I wrote about it here.

EDIT 4 Jan 2022: So there wasn’t a misunderstanding after all. Please ignore the above message.

Happy New Year! This is going to be a pretty dry post: I’ve got no footage and very few screenshots / diagrams to show. There’s no new in-game progress to demonstrate because I got sucked into a winding, domino-like progression which started with a failed attempt to rewrite how sprites are structured, and has led to a still-in-progress rewrite of the sprite drawing code.

December Summary

Tested, and then reverted SpriteBatch range culling on tilemaps.
Revisited input handling. Unified keyboard and gamepad state.
Replaced several round-robin asset pools with simpler stack structures.
Began rewrite of sprite rendering to use manual SpriteBatches and attribute meshes instead of autobatching and shader uniforms. Some issues popped up and I took a step back to work on test programs.

setDrawRange()

(NOTE: I wrote this section near the beginning of the month. Since then, I’ve upgraded LÖVE to an 11.4 artifact which has a performance fix in a somewhat related area. So this might be outdated. I thought had I saved a snapshot of the project in case there was a need to revisit the idea, but it looks like I saved the project before making these changes and after removing them. Gah!)

I looked into a possible quick optimization for drawing SpriteBatch tilemaps. By assigning tiles to the SpriteBatch in row-major or column-major order, the setDrawRange() method could be used to cull vertices on two sides of the viewport. Column-major would be used for tall maps, and row-major for wide maps. (A square map would default to row-major, for no particular reason other than it has to be one or the other, and the viewport is shorter on the vertical dimension.) While it won’t get rid of all tiles that are out of view, it could still drop between 50% and 75% of them in a typical medium-sized room, and more for larger rooms. Below is an animated diagram:

Top: column-major for horizontal rooms. Bottom: row-major for vertical and square rooms. Note that the quantity of tiles here is reduced for demonstration purposes. A typical in-game map would have 2500 tiles (ie 100×25) or more.

This method has good coverage for wide and tall maps, but would have quite a bit of waste in a large square map. Luckily, most in-game rooms at this point aren’t shaped like this.

This method’s ability to cull in large square rooms is limited.

Here’s the fun part: on average, it runs slower than just drawing batches with the full range. Even if the viewport is only displaying 2% of all tiles. At first, I thought some background processes on my PC might be interfering with the benchmarks. I also tried caching the calculated draw range and only calling SpriteBatch:setDrawRange() when it changes. The only way it beats the old drawing function is if the map is extremely large, far beyond the size of a typical map for this game.

Either the expression I use to calculate the draw range offsets is really inefficient, or just the act of calling setDrawRange() introduces some sort of additional overhead. Or both. In most scenarios, these would only be executed between 1 and 15 times per frame, so I don’t understand how either could have that much impact.

All things considered, the difference is negligible. Most recorded framerates were between 575 and 615 FPS, and there was some variance in the output between runs, but the culling versions never reached the upper range of the non-culling ones. So I reverted back to no range assignments. ~~I’ll keep a snapshot of the project with it implemented just in case something changes in the future (different GPU/driver, changes in LÖVE and/or LuaJIT, etc.)~~ Oops I did not do that.

Cleaning Up Input Handling

I haven’t touched the input code in a long time. I recently wanted to add keybind support to the overlay console, and also add virtual combined buttons for keys which appear twice on the standard layout: shift, alt, ctrl, enter, and gui. I ran into some issues related to how press and release events are represented internally.

LÖVE provides two ways to get user input: callbacks and isDown functions. The callbacks are processed in love.run() at the beginning of the frame, and multiple press and release events for one key/button can be queued in a single frame. isDown is called in project code and returns the most recent state of the key, so it’s possible for it to miss events if the framerate is low. For this reason, the callbacks are recommended over isDown.

For a large game, it’s not practical to put all game logic which acts on button press/release events inside the callback functions. Therefore, the callbacks update a set of internal tables representing the pressed state of each button, and then the game logic accesses those tables, either directly or through functions.

Up to this point, each kind of input was kept in a separate table: one for keyboard keyconstants, one for keyboard scancodes, one for gamepad buttons (and axes mapped to virtual buttons), and one for the mouse. KeyConstants vary depending on the user’s keyboard and OS settings, while scancodes are based on where the buttons would appear on the US QWERTY layout. If you switched from QWERTY to Dvorak, the keyconstants would move around and the scancodes would stay in place.

The problem with using separate tables is that it’s not convenient to check the state of a button which could be any of those four types without additional information. The string “a” could refer to “a” the keyconstant, “a” the scancode, or “a” the gamepad face button.

The solution was basically already in the game’s config file system: differentiate between the types by putting “kc_”, “sc_” or “gp_” in front of the identifier. So I merged keyconstants, scancodes and gamepad buttons into one state table, and I changed the callbacks to assign prepended versions of their string IDs using lookup tables. (I haven’t dealt with the mouse yet since it’s quite different from the others, and not currently used in this project.)

This had a beneficial side effect of removing many input-reading functions which were practically the same but which targeted different input types. Now, I only need a set of functions to read from a generic button state table. The only downside is that I have to use those prepended versions of the IDs (‘kc_q’ instead of just ‘q’).

Round-Robin Pools to Stacks

Early on, I implemented a pool system for storing and giving out tables in a cyclical fashion. It was (and still is) used for actors, where a stable index to reference them while arbitrarily adding and removing them from a scene was a useful feature. It also allowed scenes to maintain their own pools of actors, so that there (hopefully) wouldn’t be cross-contamination of corrupted tables between scenes.

As I added more features to the engine, I decided to use this pooling system to recycle other assets as well. But those new things — sprites, hitboxes, generic array tables, “actor target/reference tables” — don’t require an unchanging index. Table corruption is also less of a problem now that I have write-guard functions assigned to __newindex metamethods. I realized this was nonsense when I considered adding a pool for LÖVE Quads and saw how much overhead is involved.

It’s still worthwhile to borrow and release assets instead of creating and discarding them every time. Reinitializing an asset is almost always faster than creating a new one, and not discarding them means the garbage collector doesn’t have to clean them up. Even clearing out all fields in an array table and reusing it can be faster than creating a fresh table, provided the number of fields is small.

The simplest replacement I could think of is to make a global stack object for each type of asset. During startup, prepopulate the stack with a bunch of assets that are ready to go. To get an asset, assign the result of stack:pop() to a variable. To release an asset, call stack:push(asset) and nil out the variable reference. When pop() is called on an empty stack, it returns a newly-created asset. If stack:push(asset) would make the stack greater than a configured cutoff value, then the asset is discarded and will eventually be destroyed by the garbage collector.

I implemented this as a Lua array table with some __index metamethods for push and pop actions. The first elements are reserved for object state:

[1] Stack Counter. (7 is empty)
[2] Function: create a new asset
[3] Function: clean a pushed (returning) asset
[4] Function: initialize a popped (pre-existing, outgoing) asset
[5] Function: tear-down an asset which is being discarded
[6] Cutoff point - pushed assets above this value are discarded
[7] Stack overflow point - crash if the stack reaches this size
[8..n] Actual assets start here.

I don’t know if I’ll keep it this way. It’s horrible to read. (I’m using a preprocessor to replace the numeric indexes with terms that are easier to understand.) A more sensible way would be to separate the stack state from the config fields with a sub-table:

local stack = {
    array = {},
    stack_counter = 0,
    fn_create = dummyFunc,
    fn_push = dummyFunc,
    fn_pop = dummyFunc,
    fn_teardown = dummyFunc,
    cutoff_point = (2^32),
    overflow_point = (2^32),
}
setmetatable(stack, _mt_stack) -- stack:push(), stack:pop() implemented here

Cramming it all into one table would save one table lookup. I don’t know. I guess it doesn’t matter so long as the internals are kept hidden from the things that call it.

Fixing Batching

When I added the sprite shader, I implemented per-sprite shader parameters as uniforms. This worked, but it broke LÖVE’s automatic batching (where multiple commands are combined into one draw call to the GPU.) Batched draw calls are highly desirable because the GPU can do a large number of tasks in parallel without waiting for further input from the CPU.

In shader code, uniforms are treated like constants, and I believe there’s a limit to how many you can expose at a time. Another way to pass information to the shader is to render sprites with a SpriteBatch and attach a mesh with custom vertex attributes to it. In the shader, the per-vertex attributes can be accessed as varying variables (or in/out in GLSL3), and the vertex shader can in turn pass those to the fragment shader. So: update the SpriteBatch, update the associated mesh, and then render it in one draw call.

A while ago, I experimented with a custom vertex mesh with tilemap SpriteBatches, and it seemed to use a lot of CPU resources to initialize the mesh. I don’t have the code handy right now, but I’m pretty sure that I chose an inefficient function to do the job: one that’s intended for updating a single vertex, not large portions of the mesh. So that turned me away from looking at meshes until now. (I swore that I wrote something about this in an earlier devlog entry. If I find it, I’ll edit in a note.) Updating the attributes with mesh:setVertices(tbl) seems fast enough, and certainly faster than updating uniforms for every sprite.

Anyways, the tilemaps are already batched, so this change would affect sprites. There were a few concerns in switching from love.graphics.draw() and shader:send() to SpriteBatches and Meshes:

I guess this doesn’t really matter, but: the attributes are per-vertex, while the info I want to pass along is per-sprite. Varying variables are interpolated between the four vertices, which is not helpful in my case. I realized that due to how LÖVE’s vertex table is laid out as a nested table of tables, I can duplicate the same table for every set of four vertices. Then I only need to write the details to one of those four tables. This might help with overhead on the Lua side, but the info is still being uploaded four times for every sprite. Instanced meshes support attributes that are per-instance instead of per-vertex, so that might be an alternative.
LÖVE SpriteBatches work with quads pointing to one texture atlas. I programmed in a bunch of special-case sprite modes which aren’t compatible: temporary art shapes, text-sprites, sprites which call an arbitrary function, etc.

Here’s what I plan to do regarding the second bullet point:

Temp-art shapes
- Before: Use love.graphics.rectangle(), love.graphics.circle(), etc.
- After: Use pre-drawn 64×64 sprites and stretch them to the correct size
Text Sprites
- Before: Use love.graphics.print() or love.graphics.printf()
- After: Create “sprite-fonts” through the animation system, and assign one character per sprite
  - Most text rendering should be handled through the widget system which isn’t batched.
Function-Sprites
- Before: Call an arbitrary function stored in spr.func
- After: Either replace the functionality with multiple sprites, or re-implement the function in a widget.
Quad-Sprites
- Before: Use an arbitrary quad to draw any portion of the texture atlas
- After: Merging with “normal” sprites — every sprite will now have its own quad, instead of using the texture atlas’s reference quads directly.
  - Quads will also be used to implement per-sprite cropping instead of using GL Scissor Boxes.

So far, only the temp-art sprite change is implemented. I became sidetracked by the question of whether I should use the same shader for tilemaps and sprites (which is how it worked before this), and if yes, how should it be implemented. My tilemap effects are per-layer, not per-tile. While considering a way of incrementally updating tile vertex attributes, I found that writing to higher-index vertices in the attribute mesh, even just one at a time, lead to longer draw times. I reached out to the LÖVE forums and they provided a fix. Excellent. Now I’m curious about performance gains which could be gotten by using love.graphics.drawInstanced() with a mesh. And that’s pretty much where I’m at for December.

The test program will let me profile the following ways of drawing sprites:

Render Method
- Plain calls to love.graphics.draw(); shader state sent as uniforms
- SpriteBatch; attribute mesh for sprite state
- Instanced Mesh; attribute meshes for position, texture coords, color, sprite state
Shaders
- Enabled, Disabled
Means of updating vertices
- Via Tables
- Via ByteData + FFI
JIT
- Enabled, Disabled

Project Outlook

I wish I had ended this month cleanly. I guess I have only myself to blame. Oh well. It’s not the end of the world or anything.

I’m considering restructuring the codebase so that JIT can be selectively enabled on a few specific tasks, such as updating vertex info in ByteData objects. Before I get there, I have to finish my test drawing program and evaluate the performance differences between SpriteBatches and instanced Meshes. I also need to fix the mess I made with sprite structures.

Stats

Codebase Issue Tickets: 45 (+0)
Total LOC: 121761 (-16)
Core LOC: 36024 (+110)
Estimated Play-through Time: 11:09.46 (+0:00)*

*: Time is from October since nothing in terms of gameplay has changed.

2021 Year in Review

Whelp, this is taking a long time. Here’s a list of some things I did this year, based on skimming my devlog posts:

Player, Gameplay:
- Revamped player attack, added ability to tele-pull thrown shots
- Implemented crouch-walking and hanging from hooks
Art:
- Redesigned ‘bot’ enemies to look less like the player
- Redesigned player legs + feet to have better foot-to-ground readability
- Increased number of frames in most player animations
Collision and Physics:
- Implemented a pseudo fixed-point coordinate system where every pixel contains 4096×4096 sub-positions
- Added support for multiple hitboxes per actor
- Added hitbox region splitting (one hitbox can have different properties on top vs bottom, left vs right, etc.)
- Added sloped moving block actors (‘barricades’) and unified floating platforms with blocks (‘boards’)
Building / Packaging:
- Added a preprocessor which implements named constants and simple macros
- Wrote a Tiled TMX/TSX-to-Lua converter
- Wrote a GraphicsGale GAL-to-PNG converter to smooth out some problems with art production
Engine:
- Revamped actor state management
- Simplified world model to have only one room active at a time
- Implemented managed room-to-room transfers
- Changed renderer to support drawing pixel art “off-grid” depending on the canvas scale

My hope is that I get the rendering stuff cleared up very, very soon, and then focus on more prototype level content. I cut down on a lot of the game’s complexity this year, so that ought to help. It’s been much easier to think about the construction of levels and “who owns what” since the switch away from multiple simultaneously active rooms to just a single room. Anyways, I’ll post another update hopefully around the end of January.

(e 2022/Jan/02: A note about love.graphics.drawInstanced().)
(e 2022/Jan/04: Another note about drawInstanced().)

Balleg Devlog Dec 2021 + Year in Review