Why not adapt and fine-tune from an existing base model, as their training is open ended and does not yet include the assistant role? If you are to train from scratch, what do the synthetic data, or the model's environment look like? Is this "embodiment" for the model grounded in a video game, a random narrative, or something else? Is there to be any components to the "experience" besides word/token input/output? What do you see as the benefit of this project, if it's successful?