Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)

The baseline

To make it easier to follow along with this post, I’ve created a GitHub repo. As a starting point, I recommend you clone that, and then check out the baseline tag:

giles@perry:~/Dev $ git clone https://github.com/gpjt/hf-tutorial-post.git
Cloning into 'hf-tutorial-post'...
remote: Enumerating objects: 24, done.
remote: Counting objects: 100% (24/24), done.
remote: Compressing objects: 100% (19/19), done.
remote: Total 24 (delta 5), reused 19 (delta 2), pack-reused 0 (from 0)
Receiving objects: 100% (24/24), 37.23 KiB | 866.00 KiB/s, done.
Resolving deltas: 100% (5/5), done.
giles@perry:~/Dev $ cd hf-tutorial-post/
giles@perry:~/Dev/hf-tutorial-post (main)$ git checkout baseline
Note: switching to 'baseline'.

You are in 'detached HEAD' state. You can look around, make experimental
...rest of warning skipped...
Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 4047a91 Added baseline code
giles@perry:~/Dev/hf-tutorial-post $

You’ll see that there’s a gpt.py file, which contains my version of the GPT-2 style LLM code from Sebastian Raschka’s book "Build a Large Language Model (from Scratch)". There’s also a script called inference_run.py, which is some code to run a model and get it to predict the 20 next words after the string Every effort moves you, and a config file for the LLM code called model.json, which tells it the number of layers, attention heads, and so on.

If you want to use it and see what it comes up with, you can download the model weights from one of my trains, and install the dependencies with uv sync (recommended) or by running it in a Python environment with the libraries listed in pyproject.toml installed.

You’ll get something like this:

giles@perry:~/Dev/hf-tutorial-post $ uv run inference_run.py ./model.json ./model.safetensors
Every effort moves you through the process to make it happen. But we still want to bring it to all of your dreams

Your output will probably vary (for this and the later examples), as you’d expect from sampled LLM output, but it should at least be reasonably coherent.

So: let’s get it on Hugging Face!

The `from_pretrained` methods

Our goal of being able to run inference with Transformers’ pipeline system relies on a couple of deeper levels of abstraction.

The pipeline requires that the model be available for download – complete with all of its code and weights – using code like this:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("some-hf-user/some-model-name", trust_remote_code=True)

AutoModelForCausalLM is the HF abstraction for models that generate text.

If that trust_remote_code flag is concerning you, it is indeed a bit scary-looking. But remember that our goal here is to share a model on HF that has its own code, and that means that anyone that downloads it will have to opt in to downloading and running the code – the flag is how they do that opt-in. So it is, unfortunately, necessary.

Now, that model will need a tokeniser in order to run. Perhaps not surprisingly, the HF system expects to be able to download that with similar code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("some-hf-user/some-model-name")

With both of those working, appropriate code for our pretrained models, and a bit (well, to be fair, quite a lot of) configuration, we’ll be all set.

But that’s quite a big jump. There is a more general Auto class called AutoModel; it’s much simpler, just wrapping a generic model that might be doing anything. If we support it, we’ll still need to use all of that clunky inference code, but the model’s code and weights will be on Hugging Face Hub, and can be downloaded and instantiated easily.

So let’s get that working first, just to work out the bugs and get the basic process down pat.

`AutoModel.from_pretrained`

Our goal is to be able to run this in a Python environment where we just have transformers and torch installed:

from transformers import AutoModel
model = AutoModel.from_pretrained("some-hf-user/some-model-name", trust_remote_code=True)

...and then have a model that we can run inference on, just like the code in our repo, but without the hassle of having to download the weights ourselves. Definitely a QoL improvement, even if it’s not the endgame.

If you’re following along with the git repo, the tag to check out for this section is automodel. In this version, you’ll see a new subdirectory to contain our HF wrapper code (which I’ve imaginatively called hf_wrapper); you’ll see why we need that later.

In there, I’ve added a symlink to the model code gpt.py itself (also to be explained later), an empty __init__.py file to make the directory a Python module, and two files with some Transformers code:

configuration_gpjtgpt2.py
modeling_gpjtgpt2.py

Let’s dig into what’s going on in those two.

The first thing to understand is that whole gpjtgpt2 thing in the filenames. Transformers is designed to handle all kinds of different models – for example, Meta’s Llama models and Qwen’s models have their own codebases. These widely-used public models have code that is already built in to the library, with "model types" like llama4 and or qwen3_vl-moe respectively – but we don’t have that advantage. Our code is not built in to the library.

So we need a distinct name for our type of model, which will let the library know that it has its own code and it shouldn’t try to rely on built-in stuff. I chose gpjtgpt2 because my Hugging Face username is my initials, gpjt 1, and this model is the implementation of the GPT-2 architecture I’m playing with. That feels like a solid pattern to me – it’s unlikely to clash with anything built in. But the format appears to be fairly free-form, so you can choose pretty much anything so long as you’re consistent throughout your code, and so long as it doesn’t clash with any of the built-ins.

So, you need two files with those specific names: configuration_your-model-type.py, and modeling_your-model-type.py. Let’s look at them now. They’re really simple at this stage; here’s the configuration one:

from transformers import PretrainedConfig


class GPJTGPT2Config(PretrainedConfig):

model_type = "gpjtgpt2"

def __init__(self, cfg=None, **kwargs):
self.cfg = cfg

super().__init__(**kwargs)

Now, when Transformers is loading a model with AutoModel.from_pretrained, it’s going to need to know how to configure it. At the very least, it will need to know what to pass into the __init__. If you look at the gpt.py code, it’s taking a config dictionary with stuff like the number of layers, the number of attention heads, and what-have-you. That’s going to be required to instantiate the model with the right setup so that it can load the weights that we’re providing. There’s other config stuff that will come there later, but that’s all we have for now.

It does this using the same pattern as the various from_pretrained methods we were looking at earlier:

from transformers import AutoConfig
model = AutoConfig.from_pretrained("some-hf-user/some-model-name")

All we’re doing here is defining what kind of thing that method will return when it’s all set up properly.

You can see that we’re inheriting from a PretrainedConfig class – this provides all of the infrastructure we’re going to need to push things to HF. I don’t think that the name of the config class technically matters, but it definitely seems like best practice to name it based on the model name – so, we’re using GPJTGPT2Config for our gpjtgpt2 model. However, the model_type is important – it has to match the model type that we’ve chosen and used for our filenames.

Apart from that, we’re stashing away the config that we’re provided on a cfg field, and then calling our superclass __init__, forwarding on any kwargs we got in our own __init__.

Now let’s look at modeling_gpjtgpt2.py:

from transformers import PreTrainedModel

from .configuration_gpjtgpt2 import GPJTGPT2Config
from .gpt import GPTModel


class GPJTGPT2Model(PreTrainedModel):

config_class = GPJTGPT2Config


def __init__(self, config):
super().__init__(config)
self.model = GPTModel(config.cfg)
self.post_init()


def forward(self, input_ids, **kwargs):
return self.model.forward(input_ids)

Just as with the config, there’s PreTrainedModel for us to inherit from 2. We’re defining the thing that AutoModel.from_pretrained will return when it’s all set up properly.

We tell transformers that this should be configured with the GPJTGPT2Config that we just defined using that config_class class variable, but apart from that, we’re basically just wrapping the GPTModel that is defined in gpt.py 3. That is imported using a relative import using from .gpt rather than from gpt:

from .gpt import GPTModel

This is important – it has to be that way, as we’ll discover later. But for now: that’s why we had to create the hf_wrapper subdirectory and the symlink to gpt.py – a relative import in Python can only happen if you’re not in the "root" module, so we would not have been able to do that kind of import if the files were at the top of our repo.

Now, let’s take a look at the __init__. We’re calling the superclass __init__, as you’d expect, then we’re creating an underlying wrapped GPTModel. We’re expecting a GPJTGPT2Config parameter, which has the underlying model’s configuration stashed away in its cfg field by its own __init__, so we can pass that down to the wrapped model.

Finally, we call this special self.post_init() function; that does some extra configuration, and prior to Transformers 5.0.0 you could get away without calling it, but now it’s 100% necessary, as otherwise it will not initialise its internal fields relating to whether or not the model uses weight tying.

Now let’s take a look at how we actually use those to upload the model. That’s back at the root of the repo, in the file upload_model.py. Before looking at the code, try running it:

giles@perry:~/Dev/hf-tutorial-post $ uv run upload_model.py --help
Usage: upload_model.py [OPTIONS] MODEL_CONFIG_PATH MODEL_SAFETENSORS_PATH
HF_MODEL_NAME

Options:
--help  Show this message and exit.

So, it takes a model config path – that model.json file we have to set the number of layers and so on – and the path of a safetensors file containing the weights. It will then try to upload our HF-friendly wrapped version of the model – code, weights and config – to the Hub.

Let’s see how it works.

import json
from pathlib import Path

import click

from safetensors.torch import load_file

from hf_wrapper.configuration_gpjtgpt2 import GPJTGPT2Config
from hf_wrapper.modeling_gpjtgpt2 import GPJTGPT2Model

We do some boilerplate imports, and then import our config and our model classes – importantly, via the hf_wrapper submodule. Don’t worry, we’re getting close to the explanation of why that is :-)

@click.command()
@click.argument("model_config_path")
@click.argument("model_safetensors_path")
@click.argument("hf_model_name")
def main(model_config_path, model_safetensors_path, hf_model_name):
if not Path(model_config_path).is_file():
raise Exception(f"Could not find model config at {model_config_path}")
with open(model_config_path, "r") as f:
model_config = json.load(f)

if not Path(model_safetensors_path).is_file():
raise Exception(f"Could not find model safetensors at {model_safetensors_path}")

A bit of argument-validating boilerplate and the loading of the model config file into a dictionary so that we can use it, and now we get to the meat of it:

GPJTGPT2Config.register_for_auto_class()

What this is doing is telling our GPJTGPT2Config to register itself so that it is a thing that will be returned by the AutoConfig.from_pretrained call. This only applies locally for now, but by setting things up locally we’re telling the library what it will need to push up to the hub later. Next:

GPJTGPT2Model.register_for_auto_class("AutoModel")

We’re doing exactly the same for our model, saying that it should be returned from AutoModel.from_pretrained. We need to be explicit about which of the various model classes we want to register it for – the config class can only be loaded from AutoConfig.from_pretrained, whereas the model might be something we’d want to have returned from AutoModelForCausalLM.from_pretrained, or if it was a different kind of model, perhaps AutoModelForImageTextToText.from_pretrained, or something else entirely.

What we want to do here is expose the basic model using AutoModel, so that’s what we do.

config = GPJTGPT2Config(model_config)

We’re creating our config class, passing in that model configuration that we loaded from the model.json file earlier, so that it will stash it on its cfg field, then:

model = GPJTGPT2Model(config)

...we create our model wrapper using that config. We now have an instance of our custom model, but with uninitialised weights. So:

model.model.load_state_dict(load_file(model_safetensors_path))

...we load in the weights that were specified on the command line. Note that we have to load them into the wrapped model. The model.safetensors file we have is specifically for the custom GPTModel that we want to publish, not for the wrapped GPJTGPT2Model one. But that’s easily done by using the model.model field.

Finally, the magic:

model.push_to_hub(hf_model_name)

This is where the Transformers library really shows its strength. It will push the model, which means it needs to push the weights that we loaded into its wrapped GPTModel. Then it will look at the class GPJTGPT2Model that defines the model, and will push the modeling_gpjtgpt2.py file that has the source for that class. It will see that it also has a dependency on GPJTGPT2Config, and will push that and its source configuration_gpjtgpt2.py.

It will also spot the setup we did with our two calls to the different register_for_auto_class methods above to register them for the AutoConfig.from_pretrained and AutoModel.from_pretrained and push that too.

And when it’s pushing the source, it will try to push the source of any dependencies too. This is where we get the final explanation of why we had to put it in a submodule, and have a symlink to gpt.py. The push_to_hub code doesn’t want to upload loads of extra stuff – for example, any libraries you’re using. It wants to be sure that it’s only uploading your model code.

The logic it uses for deciding whether or not something is part of the uploadable set of files is "was it imported relatively from the modeling_ or the configuration_ file" – that is, with a dot at the start of the module name, from .something import SomethingElse rather than from something import SomethingElse.

In order to do that kind of import, we needed to create a submodule. And in order to access our gpt.py file we need a copy of it inside the submodule. I didn’t want to have two actual copies of the file – too easy to let them get out of sync – so a symlink sorts that out.

Hopefully that clears up any mystery about this slightly-strange file layout.

Let’s give it a go and see what it creates! In order to upload a model to the HF Hub, you’ll need an account, of course, so create one if you don’t have one. Next, create an access token with write access – the option is in the "Access Tokens" section of the "Settings".

Then you need to authorize your local machine to access the hub using that token; if you’re using uv, then you can just run:

uvx hf auth login

If you’re not, you’ll need to download and install the HF CLI and then run

hf auth login

That will store stuff on your machine so that you don’t need to log in again in the future – if you’re concerned about security, there’s an hf auth logout you can call, and you can completely trash the session by deleting the associated token from the HF website.

Now, let’s run our upload script!

giles@perry:~/Dev/hf-tutorial-post $ uv run upload_model.py model.json model.safetensors gpjt/test1
Processing Files (1 / 1)      : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████|  702MB /  702MB,  270MB/s
New Data Upload               : |                                                                                                            |  0.00B /  0.00B,  0.00B/s
..._ehrlvi/model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████|  702MB /  702MB

You’ll need to change the target HF model name at the end of the command to one with your username before the slash, of course.

Once you’ve done that, take a look at the model on Hugging Face. You’ll see a rather ugly default model card, but let’s ignore that for now and take a look at the "Files and versions" tab.

You should see the following files:

.gitattributes – a file telling git (which is used to manage the models on the hub) which file types should use the Large File Support plugin. Big binary files don’t play nicely with git, so it uses LFS for them. We don’t need to pay much more attention to that for our purposes.
README.md – that ugly model card. Updating that is useful, but out of scope for this post.
config.json. We’ll come back to that one in a moment.
configuration_gpjtgpt2.py – a copy of the file we created locally with our GPJTGPT2Config class.
gpt.py – again, the same file as the local one, uploaded due to that clever dependency-finding stuff.
model.safetensors – our weights. There should be an icon next to it to say that it’s stored using the LFS system.
modeling_gpjtgpt2.py – once more, a file that was just copied up from our local filesystem.

Now, let’s look into that config.json. It will look like this:

{
"architectures": [
"GPJTGPT2Model"
],
"auto_map": {
"AutoConfig": "configuration_gpjtgpt2.GPJTGPT2Config",
"AutoModel": "modeling_gpjtgpt2.GPJTGPT2Model"
},
"cfg": {
"context_length": 1024,
"drop_rate": 0.1,
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"qkv_bias": false,
"vocab_size": 50257
},
"dtype": "float32",
"model_type": "gpjtgpt2",
"transformers_version": "4.57.6"
}

The architectures bit is just showing the name of the class that was used in the push_to_hub call. This will become useful later when we get onto the pipeline code, but doesn’t matter right now – the next one is more important.

The auto_map is essentially saying, if someone does AutoConfig.from_pretrained on this model, then use the configuration_gpjtgpt2.GPJTGPT2Config class from here, and likewise for AutoModel.from_pretrained should use modeling_gpjtgpt2.GPJTGPT2Model. It’s what that register_for_auto_class stuff we did in the upload script set up.

The cfg is just the parameters that we’re threading down to our underlying custom GPTModel class; nothing exciting there.

The dtype is, of course, the floating point type we’re using for the model, and the model_type is our unique name for this particular architecture. And the transformers_version is the version of the library used to upload it, presumably used to determine compatibility when downloading models with earlier or later versions.

So, it looks like there’s enough information across those files on the hub to instantiate and use our model! Let’s give that a go.

The best way to check it out thoroughly is to create a completely fresh directory, away from our existing ones, and a fresh environment:

giles@perry:~/Dev/hf-tutorial-post $ mkdir /tmp/test1
giles@perry:~/Dev/hf-tutorial-post $ cd /tmp/test1
giles@perry:/tmp/test1 $ uv init
Initialized project ``test1``
giles@perry:/tmp/test1 $ uv add transformers torch accelerate tiktoken ipython
Using CPython 3.14.2 interpreter at: /usr/bin/python3.14
Creating virtual environment at: .venv
Resolved 64 packages in 109ms
...junk skipped...
+ typing-extensions==4.15.0
+ urllib3==2.6.3
+ wcwidth==0.3.1
giles@perry:/tmp/test1 $ uv run ipython

and then to try to use the model:

In [1]: from transformers import AutoModel

In [2]: model = AutoModel.from_pretrained("gpjt/test1", trust_remote_code=True)
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 438/438 [00:00<00:00, 1.52MB/s]
configuration_gpjtgpt2.py: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 217/217 [00:00<00:00, 889kB/s]
A new version of the following files was downloaded from https://huggingface.co/gpjt/test1:
- configuration_gpjtgpt2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_gpjtgpt2.py: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 394/394 [00:00<00:00, 1.99MB/s]
gpt.py: 5.07kB [00:00, 12.4MB/s]
A new version of the following files was downloaded from https://huggingface.co/gpjt/test1:
- gpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/gpjt/test1:
- modeling_gpjtgpt2.py
- gpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 702M/702M [00:09<00:00, 71.6MB/s]

In [3]: type(model)
Out[3]: transformers_modules.gpjt.test1.b936caf64b6776917478339cbcf9f95bdca7dda9.modeling_gpjtgpt2.GPJTGPT2Model

So we can see where Transformers has put the downloaded code, inside a submodule that appears to have a GUID-like name. Now let’s try to run some inference on it:

In [4]: import math
...: import tiktoken
...: import torch
...:
...: tokenizer = tiktoken.get_encoding("gpt2")
...:
...: input_text = "Every effort moves you"
...: tokens = tokenizer.encode(input_text)
...:
...: num_tokens = 20
...: temperature = 1.4
...: top_k = 25
...: with torch.no_grad():
...:     for ix in range(num_tokens):
...:         input_tensor = torch.tensor(
...:             tokens, dtype=torch.long
...:         ).unsqueeze(0)
...:         output_tensor = model(input_tensor)
...:         logits = output_tensor[:, -1, :]
...:         top_logits, _ = torch.topk(logits, top_k)
...:         min_val = top_logits[:, -1]
...:         logits = torch.where(
...:             logits < min_val,
...:             torch.tensor(-math.inf).to(logits.device),
...:             logits
...:         )
...:         logits /= temperature
...:         probs = torch.softmax(logits, dim=-1)
...:         next_token = torch.multinomial(probs, num_samples=1).item()
...:         tokens.append(next_token)
...:
...: print(tokenizer.decode(tokens))
Every effort moves you to take on what’s coming—from developing you the skills you need to build an online

So there we go! We’ve gone from a situation where we would have to publish the code and the safetensors in some way and tell people how to combine them, to a neatly-packaged model that we can download, fully set up, with just one line:

model = AutoModel.from_pretrained("gpjt/test1", trust_remote_code=True)

But that inference loop is still a pig; if you’ve been working with LLM code then it’s not too bad – a basic bit of autoregression with top-k and temperature – but it’s definitely holding us back. What next?

`AutoTokenizer.from_pretrained`

One obvious issue with the code above is that we still have that dependency on tiktoken. If we’re going to run inference using the simple HF pipeline object, it’s going to need to know how to encode the input and decode the outputs. And if you have your own tokeniser (which, if you have a truly custom model, you probably do) then you won’t have the luxury of being able to just install it into the target runtime env – you would still need to copy file around.

Now, as I said at the start, I’m not going to go into this in as much detail, because my use case was really simple – although I was using tiktoken, the specific tokeniser I was using from that library was the standard GPT-2 one. Transformers has its own version of that installed. So here I’ll explain how you do things for models that use a built-in Transformers tokeniser. After that I’ll give some pointers that you might find useful if you’re using something more custom.

The good news if you’re using a "standard" tokeniser that is already built into the Transformers library is that you can tell your model to use it. The downside is that you can’t do it by using the register_for_auto_class trick that we did above – that is, you can’t just import it:

from transformers import GPT2Tokenizer

...and then add this below our previous calls to register the model and config as auto classes:

GPT2Tokenizer.register_for_auto_class("AutoTokenizer")

That will essentially do nothing.

However, tokenisers do have their own push_to_hub method, and the target that you specify can be your model. So, for my own models, I’m using this:

tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.push_to_hub(hf_model_name)

That is, we get the tokeniser for the built-in GPT-2 implementation (specifically the "fast" one, written in Rust), set the padding token to the end-of-sequence one for tidiness (not sure why that’s not the case by default), and then push it to the model.

If you’re following along with the code, you can check out the autotokenizer-gpt-2 tag to see that. The code goes immediately after we’ve pushed the model itself to the hub.

So, run the upload again:

giles@perry:~/Dev/hf-tutorial-post ((HEAD detached at autotokenizer-gpt-2))$ uv run upload_model.py model.json model.safetensors gpjt/test2
Processing Files (1 / 1)      : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████|  702MB /  702MB,  339MB/s
New Data Upload               : |                                                                                                            |  0.00B /  0.00B,  0.00B/s
...w05qhqd/model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████|  702MB /  702MB

And now we can do a completely fresh env without tiktoken:

giles@perry:~/Dev/hf-tutorial-post $ mkdir /tmp/test2
giles@perry:~/Dev/hf-tutorial-post $ cd /tmp/test2
giles@perry:/tmp/test2 $ uv init
Initialized project <!--CODE_BLOCK_8744-->
giles@perry:/tmp/test2 $ uv add transformers torch accelerate ipython
Using CPython 3.14.2 interpreter at: /usr/bin/python3.14
Creating virtual environment at: .venv
Resolved 63 packages in 113ms
░░░░░░░░░░░░░░░░░░░░ [0/61] Installing wheels...                                                                                                                          warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
If the cache and target directories are on different filesystems, hardlinking may not be supported.
If this is intentional, set <!--CODE_BLOCK_8745--> or use <!--CODE_BLOCK_8746--> to suppress this warning.
Installed 61 packages in 585ms
+ accelerate==1.12.0
...junk skipped...
+ wcwidth==0.3.1
giles@perry:/tmp/test2 $ uv run ipython

In there, we can see that AutoTokenizer.from_pretrained works:

In [1]: from transformers import AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained("gpjt/test2", trust_remote_code=True)
configuration_gpjtgpt2.py: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 217/217 [00:00<00:00, 591kB/s]
A new version of the following files was downloaded from https://huggingface.co/gpjt/test2:
- configuration_gpjtgpt2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 507/507 [00:00<00:00, 1.58MB/s]
vocab.json: 798kB [00:00, 6.84MB/s]
merges.txt: 456kB [00:00, 9.61MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 131/131 [00:00<00:00, 426kB/s]
tokenizer.json: 3.56MB [00:00, 25.3MB/s]

In [3]: tokenizer.pad_token
Out[3]: '<|endoftext|>'

(Note that I had to use trust_remote_code here – that appears to be new in Transformers 5.0.0.)

And do our inference test:

In [4]: from transformers import AutoModel

In [5]: model = AutoModel.from_pretrained("gpjt/test2", trust_remote_code=True)
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 438/438 [00:00<00:00, 1.65MB/s]
configuration_gpjtgpt2.py: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 217/217 [00:00<00:00, 860kB/s]
A new version of the following files was downloaded from https://huggingface.co/gpjt/test2:
- configuration_gpjtgpt2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_gpjtgpt2.py: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 394/394 [00:00<00:00, 1.80MB/s]
gpt.py: 5.07kB [00:00, 12.4MB/s]
A new version of the following files was downloaded from https://huggingface.co/gpjt/test2:
- gpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/gpjt/test2:
- modeling_gpjtgpt2.py
- gpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 702M/702M [00:07<00:00, 99.2MB/s]

In [6]: import math
...: import torch
...:
...: input_text = "Every effort moves you"
...: tokens = tokenizer.encode(input_text)
...:
...: num_tokens = 20
...: temperature = 1.4
...: top_k = 25
...: with torch.no_grad():
...:     for ix in range(num_tokens):
...:         input_tensor = torch.tensor(
...:             tokens, dtype=torch.long
...:         ).unsqueeze(0)
...:         output_tensor = model(input_tensor)
...:         logits = output_tensor[:, -1, :]
...:         top_logits, _ = torch.topk(logits, top_k)
...:         min_val = top_logits[:, -1]
...:         logits = torch.where(
...:             logits < min_val,
...:             torch.tensor(-math.inf).to(logits.device),
...:             logits
...:         )
...:         logits /= temperature
...:         probs = torch.softmax(logits, dim=-1)
...:         next_token = torch.multinomial(probs, num_samples=1).item()
...:         tokens.append(next_token)
...:
...: print(tokenizer.decode(tokens))
Every effort moves you forward as you become a successful artist. That’s not to say there’s any

It may not be much shorter than the code we had when we just had the AutoModel, but it’s an important step forward: we can now download and run inference on our custom model with none of the custom code – neither the model itself nor the tokeniser – on the machine where we’re doing it. Everything is nicely packaged on the HF Hub.

Now, what if you’re using a tokeniser that’s not already in Transformers? There are two possibilities here:

You’re using the HF Tokenizers library. With that, you can save your tokeniser to a JSON file, then you could load that into a Transformers PreTrainedTokenizerFast object, which provides a push_to_hub method to push it like I did with the one above.
You’ve got something completely custom. Just like there is a configuration_gpjtgpt2.py and a modeling_gpjtgpt2.py, I believe you can also add a tokenization_gpjtgpt2.py that defines a subclass of PreTrainedTokenizer, and then you can push that to the Hub just like we did our model wrapper class.

As I said, I have not done either of these, but that’s the direction I’d explore if I needed it. If you do either and want to share your experiences, then please do leave a comment below! And likewise, if and when I start writing things with custom tokenisers, I’ll link to the details of how to upload them then.

Anyway, we’ve got the tokeniser done to the level we need for this walkthrough, so let’s do the QoL improvements so that we can run inference on the model using the nice HF pipeline abstraction.

`AutoModelForCausalLM.from_pretrained` for inference

Let’s look at our target code for inference again:

from transformers import pipeline
pipe = pipeline(task="text-generation", model="some-hf-user/some-model-name", trust_remote_code=True)
out = pipe(
"Every effort moves you",
max_new_tokens=20,
do_sample=True,
temperature=1.4,
top_k=25,
)
print(out[0]["generated_text"])

The version of the code that does this is in the repo on the tag causal-lm-inference, but I’ll explain how it was put in place, with the logic behind each step.

In order to run a text-generation pipeline, we’re going to need to wrap our model in something that provides the interface for LLMs in the Hugging Face ecosystem: AutoModelForCausalLM. So, our first step is to put the plumbing in place so that we can use the from_pretrained method on that class to download our wrapped model.

IMO it’s cleanest to have two separate models, one for "simple" inference that is just a regular model – the AutoModel we have right now – and one supporting the richer interface that supports easy text generation. So we can start off by adding the basic structure to modeling_gpjtgpt2.py:

class GPJTGPT2ModelForCausalLM(PreTrainedModel):

config_class = GPJTGPT2Config


def __init__(self, config):
super().__init__(config)
self.model = GPTModel(config.cfg)
self.post_init()


def forward(self, input_ids, **kwargs):
return self.model.forward(input_ids)

We can then add code to register that to our upload_model.py script – the last line in this snippet, just below the two that already exist.

GPJTGPT2Config.register_for_auto_class()
GPJTGPT2Model.register_for_auto_class("AutoModel")
GPJTGPT2ModelForCausalLM.register_for_auto_class("AutoModelForCausalLM")

That feels like it should be enough, but for reasons I’ve not been able to pin down, it’s not – you also need to massage the "auto-map" in the config object to make it all work properly. So after that code, after we’ve created the config object, we need this:

config.auto_map = {
"AutoConfig": "configuration_gpjtgpt2.GPJTGPT2Config",
"AutoModel": "modeling_gpjtgpt2.GPJTGPT2Model",
"AutoModelForCausalLM": "modeling_gpjtgpt2.GPJTGPT2ModelForCausalLM",
}

With that in place, we could just upload our model – AutoModelForCausalLM.from_pretrained("some-hf-user/some-model-name", trust_remote_code=True) would work just fine. But the model that it would return would not be any different to the one we’ve been using so far. To get that to work, we need to update the model to say that it can generate text. That’s actually pretty easy.

Firstly, we need it to inherit from a mixin class provided by Transformers:

from transformers.generation import GenerationMixin
...
class GPJTGPT2ModelForCausalLM(PreTrainedModel, GenerationMixin):

Now, the semantics of the forward method on this class are a bit different to the ones we had previously; we were just returning the outputs of the last layer of the underlying model, the logits. For this kind of model, we need to put them in a wrapper – the reasoning behind this will become clearer when we get on to training. So our forward pass needs to change to look like this:

from transformers.modeling_outputs import CausalLMOutput

...

def forward(self, input_ids, **kwargs):
logits = self.model.forward(input_ids)

return CausalLMOutput(logits=logits)

Finally, some changes to our config class. For text generation, Transformers needs to know how many hidden layers the model has 4. In the case of the model I’m using to demonstrate, that’s the n_layers parameter in the underlying configuration, so this can go inside the __init__:

if cfg is not None:
self.num_hidden_layers = cfg["n_layers"]

Another change in the config that took me a while to puzzle out, and might catch you if you’re in the same situation: Transformers, by default, assumes that the model caches previous inputs. So in an autoregressive loop starting with Every effort moves you, the first run of the model will get the full input; let’s say it returns to. The next iteration of the loop, however, won’t be passed the full new sequence Every effort moves you to, but rather just the token that was generated last time around, to.

So you’ll get a series of predicted tokens where the first one might make sense but the rest degenerate into gibberish:

Every effort moves you to it was,
-1) with and the best that they are to not been the place

All of the tokens generated after to had just the previous token as their context.

Luckily, you just need to specify that your model doesn’t have a cache in the config class as well, after the call to the superclass __init__:

self.use_cache = False

We’re almost there! At this point, we actually have all of the code that we need for a working AutoModelForCausalLM.from_pretrained. But there’s one final tweak.

A model on the hub has a "default" model type, which is the one that we use when we do the original push_to_hub. You might remember that it appeared in the config.json in that single-element list keyed on architectures.

Previously we has this in our upload script:

model = GPJTGPT2Model(config)
...
model.push_to_hub(hf_model_name)

That means that our default is the GPJTGPT2Model model. But when the pipeline creates a model for us, it will just use the default – even for the text-generation task, it doesn’t assume we want to use the AutoModelForCausalLM.

Luckily, that’s a small change: we just upload our text-generation model instead of the basic one:

model = GPJTGPT2ModelForCausalLM(config)
...
model.push_to_hub(hf_model_name)

With all of that in place, we can run the script, upload the model, and then in a fresh environment:

In [1]: from transformers import pipeline

In [2]: pipe = pipeline(task="text-generation", model="gpjt/test3", trust_remote_code=True)
...: out = pipe(
...:     "Every effort moves you",
...:     max_new_tokens=20,
...:     do_sample=True,
...:     temperature=1.4,
...:     top_k=25,
...: )
...: print(out[0]["generated_text"])
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 572/572 [00:00<00:00, 1.29MB/s]
configuration_gpjtgpt2.py: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 331/331 [00:00<00:00, 965kB/s]
A new version of the following files was downloaded from https://huggingface.co/gpjt/test3:
- configuration_gpjtgpt2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
modeling_gpjtgpt2.py: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 905/905 [00:00<00:00, 2.55MB/s]
gpt.py: 5.07kB [00:00, 8.00MB/s]
A new version of the following files was downloaded from https://huggingface.co/gpjt/test3:
- gpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/gpjt/test3:
- modeling_gpjtgpt2.py
- gpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████

The baseline

The from_pretrained methods

AutoModel.from_pretrained

AutoTokenizer.from_pretrained

AutoModelForCausalLM.from_pretrained for inference

Similar Posts

The `from_pretrained` methods

`AutoModel.from_pretrained`

`AutoTokenizer.from_pretrained`

`AutoModelForCausalLM.from_pretrained` for inference