Hugging Face

I have been blown away exploring Hugging Face. It’s a community on a mission “to democratize good machine learning”. It provides access to a huge library of state-of-the-art models. So far I have only scratched the surface of what is available, but this blog gives a sample of things I have tried.

At the time of writing, there were 128,463 pre-trained models covering a huge range of capabilities, including computer vision, natural language processing, audio, tabular, multimodal and reinforcement models. The site is set up to make it incredibly easy to experiment with a demo, download a model, run it in a Jupyter notebook, fine-tune it for a specific task and then add it to the space of machine learning apps created by the community. For example, an earlier blog describes my FilmStars app.

Computer vision with text

This is an example from an app that uses the facebook/detr-resnet-50 model to identify objects in an image. It successfully located eight objects with high confidence (indicated by the numbers), but it was fooled into thinking part of the curved lamppost in front of the brickwork pattern was a tennis racket (you can see why).

Image-to-text models go further by creating captions describing what is in the image. I used an interactive demo to obtain suggested captions from a range of state-of-the-art models. The best result was produced by the GIT-large model, whereas a couple of models perceived a clocktower .

These models can also answer questions about images. Although all of the answers were reasonable, GIT-large produced the best response when I asked “Where is the cyclist?”

The next image is an example of text-based inpainting with CLIPSeg x Stable Diffusion, where I requested that wall should be replaced with an apartment block. The model successfully generated a new image while preserving the cyclist, flowers, arch, background and even the birds on the roof. I had great fun with this app, imagining what my friend’s house will look like, when it eventually emerges from a building site.

Continuing with the theme of image generation, I reversed the image to caption problem, by asking a stable-diffusion-v1-5 model to generate an image from the caption “a cyclist rides away through an old brick archway in a city”. It came up with an image remarkably similar to what we started with, even including a female cyclist.

Do it yourself

HuggingFace provides various ways for you to download any of the models from its library. The easiest way to do this is to set up a free account on kaggle, which offers a Jupyter notebook environment with access to a GPU.

Using a HuggingFace pipeline, you can run a model with three lines of Python code! Pipelines can be set up for the image models above, but this is an example of the code required to run a text-based natural language processing task. It creates and runs a pipeline that summarises text, using a model specifically trained to generate output in the style of SparkNotes.

from transformers import pipeline
summarizer = pipeline("summarization",model="pszemraj/long-t5-tglobal-base-16384-book-summary")
summarizer("""Sample text from a book...""")

This rather morbid sample text produced the output from Python that follows.

The fact that Henry Armstrong was buried did not seem to him to prove that he was dead: he had always been a hard man to convince. That he really was buried, the testimony of his senses compelled him to admit. His posture — flat upon his back, with his hands crossed upon his stomach and tied with something that he easily broke without profitably altering the situation — the strict confinement of his entire person, the black darkness and profound silence, made a body of evidence impossible to controvert and he accepted it without cavil.

But dead — no; he was only very, very ill. He had, withal, the invalid’s apathy and did not greatly concern himself about the uncommon fate that had been allotted to him. No philosopher was he — just a plain, commonplace person gifted, for the time being, with a pathological indifference: the organ that he feared consequences with was torpid. So, with no particular apprehension for his immediate future, he fell asleep and all was peace with Henry Armstrong.

But something was going on overhead. It was a dark summer night, shot through with infrequent shimmers of lightning silently firing a cloud lying low in the west and portending a storm. These brief, stammering illuminations brought out with ghastly distinctness the monuments and headstones of the cemetery and seemed to set them dancing. It was not a night in which any credible witness was likely to be straying about a cemetery, so the three men who were there, digging into the grave of Henry Armstrong, felt reasonably secure.

From One Summer Night by Ambrose Bierce
[{'summary_text': "Henry's body is buried in the cemetery, but it does not seem to make him any more certain that he is dead. Instead, he seems to be completely ill."}]

Having come this far, it takes only a few steps to fine tune the model to match your desired task, put it into a GitHub repository and launch your own app as a fully fledged member of the Hugging Face community. A nice explanation is available at fast.ai lesson 4.

Learning the language of the Giro

Fausto_Coppi

Computers are becoming ever better at natural language processing (NLP). I set up a neural network as a language model and loaded a set of pre-trained weights on my MacBook Pro, to see what it could do. After priming it with some text about the Giro d’Italia, this is what the model produced as a continuation.

Priming text: The Giro d’ Italia (Italian pronunciation: [ˈdʒiːro diˈtaːlja]; English: Tour of Italy; also known as the Giro) is an annual multiple-stage bicycle race primarily held in Italy, while also occasionally passing through nearby countries. The first race was organized in 1909 to increase sales of the newspaper La Gazzetta dello Sport; however it is currently run by RCS Sport. The race has been held annually since its first edition in 1909, except when it was stopped for the two world wars. As the Giro gained prominence and popularity the race was lengthened, and the peloton expanded from primarily Italian participation to riders from all over the world.

Computer generated continuation:  the race was won by the italian rider , giovanni di u_n , who won the race in the first leg of the race . the race was won by italian rider giovanni u_n , who won the race by a margin of two lengths .
= = = world tour = = =
the tour de france was the first of the tour de france . the tour de france was won by the reigning world champion , the reigning world champion , who had won the tour de france in the previous year ‘s race …

The output may not make a lot of sense, but the point is that it looks like English (in lower case). The grammar is reasonable, with commas, fullstops and a header inserted in  a logical way. Furthermore, the model has demonstrated some understanding of the context by suggesting that the Giro could be won by an Italian ride called Giovanni. The word “u_n” stands for unknown, which is consistent with the idea that an Italian surname may not be a familiar English word. It turns out that a certain Giovanni Di Santi raced against Fausto Coppi (pictured above) in the 1940 Giro, though he did not win the first stage. In addition to this, the model somehow knew that the Giro, in common with the Tour the France, is a World Tour event that could be won by the reigning world champion.

I found this totally amazing. And it was not a one off: further examples on random topics are included below. This neural network is just an architecture, defining a collection of matrix multiplications and transformations, along with a set of connection weights. Admittedly there are a lot of connection weights: 115.6 million of them, but they are just numbers. It was not explicitly provided with any rules about English grammar or any domain knowledge.

How could this possibly work?

In machine learning, language models are assessed on a simple metric: accuracy in predicting the next word of a sentence. The neural network approach has proved to be remarkably successful. Given enough data and a suitable architecture, deep learning now far outstrips traditional methods that relied on linguistic expertise to parse sentences and apply grammatical rules that differ across languages.

I was experimenting with an AWD-LSTM model originally created by Stephen Merity. This is a recurrent neural network (RNN) with three LSTM layers that include dropout. The pre-trained weights for the wt103 model were generated by Jeremy Howard of fast.ai, using a large corpus of text from Wikipedia.

Jeremy Howard converted the Wikipedia text into tokens. A tokeniser, such as spaCy,  breaks text into words and punctuation, resulting in a vocabulary of tokens that are indexed as integers. This allows blocks of text to be fed into the neural network as lists of numbers. The outputs are numbers that can be converted back into the predicted words.

The wt103 model includes a linear encoder that creates embeddings of word tokens. These are passed through three LSTM layers whose states are able to retain a memory of previous words or context. The result is passed through a decoder, employing the same weights as the encoder, to produce a softmax output that can be treated as a set of probabilities, across the vocabulary, to predict the next word token. Special forms of dropout were employed, as described in the paper, to limit overfitting and make the model more robust.

The network was trained by minimising cross-entropy loss using stochastic gradient descent. A technique called back propagation through time (BPTT) introduced some random variation into the length of the blocks of text employed in each training batch. The connection weights were iteratively improved towards optimal values.

OK, but so what?

Impressive as it may be for a neural network to spout highly plausible but nonsensical English, what possible use could this have? Well, suppose you want to automate the separation of texts into different categories, for example, distinguish different authors or determine the sentiment of tweets or predict the reaction of the stock market to news items or identify the sentiment of movie reviews. Fine-tuning a neural network that already has a good knowledge of English provides a huge advantage over starting from scratch. In lesson 10 of the fast.ai course, Jeremy Howard demonstrates how this approach can be used to achieve state of the art results on the widely used IMDB movie review data set.

Further examples

In the first example, the network finishes the sentence about a hurricane in perfect English.  Next, it comes up with the name of a critic (actually a rock music critic rather than a book reviewer, but heh!) and it makes use of its LSTM to open and close quotation marks. In the bean example, it amazingly starts describing flowers and correctly converts their dimensions from centimetres to inches!

My code can be found here.

Priming text: Hurricane Ivan was a large long Cape Verde hurricane that caused widespread damage in the Caribbean and United States The cyclone was

Computer generated continuation:  the first hurricane to strike the united states since hurricane katrina in 2005 .

Priming text: The book was widely acclaimed by the critics.

Computer generated continuation: the book was published in the united states by the american library association on june 1 , 2006 .

= = = critical response = = =

the book received mixed reviews from critics . the new york times reviewer robert christgau called it ” a great book ” and ” a great book ” . he praised the book ‘s ” strong , strong , and well – written ” prose , and the ” excellent ” prose .

Priming text: Most varieties of bean grow either as an erect bush or as a climbing plant, but a few important kinds are of intermediate form. Dwarf and semiclimbers are grown extensively. When the climbing type is grown for its immature pods, artificial supports are necessary to facilitate harvesting. Varieties differ greatly in size, shape, colour, and fibrousness or tenderness of the immature pods. In general, varieties grown for dry mature seeds produce pods that are too fibrous to be eaten at any state of development. Most edible-podded beans produce relatively low yields of mature seeds, or seeds that are of low eating quality. Seed colours range from white through green, yellow, tan, pink, red, brown, and purple to black in solid colours and countless contrasting patterns. Seed shapes range from nearly spherical to flattened, elongated, and kidney-shaped. Pods are of various shades of green, yellow, red, and purple and splashed with red or purple; pod shapes range from flat to round, smooth to irregular, and straight to sharply curved; length ranges from 75 to 200 millimetres (3 to 8 inches) or more.

Computer generated continuation:  the most common form of the flower is the flower spike , which is a characteristic of the flower spike . the flower is made up of a series of small , cylindrical , cylindrical , cylindrical , cylindrical stems that are up to 10 cm ( 3.9 in ) long and 2.5 cm ( 0.98 in ) wide . the flower spikes are arranged in a series of three or four pairs of flowers