Factorio Learning Environment the ultimate Game Agent Eval — Jack Hopkins
AI Summary
Summary of Video: Exploring Factorial Learning Models
- Introduction
- Host: Allesio, CTO at Decible, with guests Jack and Mart, researchers in Factorial learning environment.
- Discussion on Factorio as a benchmarking environment for models.
- Factorio Overview
- Complexity of Factorio compared to games like Minecraft: 1 million raw resources needed to launch a rocket vs. 200 in Minecraft.
- Example factories producing from 1 to 19 million resources per second, providing a wide range for model comparison.
- Model Interactions and API
- Traditional modding through Lure scripts; however, they developed an alternative by using a multiplayer admin console over TCP.
- Models write Python code to interact with the game, allowing high-level actions and managing large factories efficiently.
- Test Structures: Lab Play vs. Open Play
- Lab Play:
- Controlled tasks to measure specific capabilities of models.
- Evaluates upper performance limits and spatial reasoning.
- Open Play:
- Models create their own objectives in an unbounded environment.
- Some models, like Deepseek, struggled with longer-term planning and often defaulted to simplistic tasks.
- Model Performance Findings
- Models like Claude excel in setting long-term objectives compared to others like Deepseek.
- Importance of spatial reasoning ability in Lab Play versus planning ability in Open Play.
- Future Directions
- Discussion on adding vision models and experiments involving screenshot analysis showed limited improvements.
- Potential for improved future performance as models evolve.
- Ongoing efforts to evaluate models against more complex settings, including incorporating skills and goal-setting behaviors for alignment studies.
- Conclusion
- Open invitation for collaboration on further development and application of this model alignment research.