The network that takes the image as input and extracts the feature maps upon which the rest is built (the output of your backbone is the first block of your figure). The RoI pooling, also known as FCN — is what the head refers to.
Disclaimer.
All the information on this website is published in good faith and for general information purpose only. Watson Media does not make any warranties about the completeness, reliability, and accuracy of this information.