Prompt Engineering

Which of the following is true for generalized disentangled representation learning and specifically
β-VAE?

i. Considering a diagonal covariance matrix in the prior distribution on the distribution of latent representation, encourages each latent dimension to model an individual underlying generative factor (axis-aligned).


ii. We can make the training goal of VAE more disentangling by giving the "prior KL" term more weight by breaking it down into a "expected likelihood term" and a "prior KL term."


iii. If we decompose the training objective of VAE into an "expected likelihood term" and a "prior KL" term and then increase the relative weight of the "prior KL" term we might lose some reconstruction (and thus generative) quality of VAE.


iv. Considering a diagonal covariance matrix in the prior distribution on the distribution of latent representation, encourages latent dimensions to be uncorrelated .


v. Disentanglement of the latent representation improves interpretability of the latent space since we can use/interpret each dimension of the latent representation as a decoupled "knob" for the generative process

-----------------------------------------------------------------------------------------------------------------

Imagine we have a trained Variational Auto-Encoder (VAE), which of the following is true for using the model for various types of inference?


a) We can use importance sampling for the estimation of density estimation with VAE using the decoder (P(x|z)) with samples guided by the encoder (Q(z|x))


b) We can use Monte-Carlo estimation for density estimation  with VAE using the decoder (P(x|z)).


c) For producing the coding for a sample, normally, we do not need the decoder but we need the assumed prior P(z).


d) For generating a batch of realistic samples following the underlying P(x), normally, we do not need the encoder but we need the assumed prior P(z).


-----------------------------------------------------------------------------------------------------------------

Which of the following statements is true for Variational Auto-Encoders (VAEs)?


a) The training objective of a VAE is the sum of multiple variational lower bounds (ELBOs)  each of which corresponding to an approximate posterior of the latent coding of a training sample.


b) For VAEs, we learn the generator parameters within the VI objectives of the latent code inference.


c) If we decompose an ELBO objective of a VAE into an "expected likelihood" term and a "prior KL" term, the encoder network is involved in both terms while the decoder network is only involved in the "expected likelihood".


d) We need the "reparametrization trick" in VAEs since we need to take a sample of the decoder during training and such sampler is not normally differentiable for the backpropagation to work


Imagine a 4-dimensional input space where each dimension can take 3 discrete values, \mathbf{x}\in\{0,1,2\}^4.

Imagine we do a factorization of the marginal as

P(\mathbf{x})=P(\mathbf{x}^{(4)})P(\mathbf{x}^{(3)}|\mathbf{x}^{(4)})P(\mathbf{x}^{(2)}|\mathbf{x}^{(4)},\mathbf{x}^{(3)})P(\mathbf{x}^{(1)}|\mathbf{x}^{(4)},\mathbf{x}^{(3)},\mathbf{x}^{(2)}).

How many parameters are required in total for an exact modeling of this factorization?


-----------------------------------------------------------------------------------------------------------------


Which of the following is true for latent variable generative models?


a) Auto-Encoders learn a latent space that can be useful for dimensionality reduction but are usually not considered truly generative models since there is no guarantee on the generation process to follow that of P(x)

b) Variational Auto-Encoders are latent variable generative models.


c) Latent variable generative models can be a principled approach to data compression and identifying the underlying generative factors of data.


d) In order to have a latent-variable generative model, it can be enough to model P(z) and P(x|z) (it is not necessary to model P(z|x))


e) It is usually assumed that the natural data (support of P(x)) lives on a much lower-dimensional region of the input space.


-----------------------------------------------------------------------------------------------------------------


Which of the following can be considered a generative model as defined in the course?


i. A standard autoencoder trained on ImageNet with the reconstruction loss (\sum_{i=1}^{n}|\mathbf{x}_i-\hat{\mathbf{x}}_i|^2_2).


ii. A conditional model P(\mathbf{x}|y=c) trained only on the dog classes of ImageNet with a 1000 classes.


iii. A model that is trained on ImageNet and can take an image as input and perfectly classifies (P(y|\mathbf{x})) it to the 1000 classes of ImageNet.


iv. A model that is trained on ImageNet and can generate unseen realistic samples from the same distribution as that of ImageNet but cannot do density estimation nor latent coding.


v. A model that is trained on whole ImageNet but only generates samples from the animal classes of ImageNet.

---------------------------------------------------------------------------------------------------------------

Which of the following can be considered as a disadvantage for a plain normalizing flow model?


i. One cannot encourage a decorrelated latent representation with normalizing flow.


ii. The smoothness and invertibility requirements of the transformation functions restricts the expressiveness of the individual transformation functions that can be used.


iii. The requirement to preserve the sample dimensionality for all the latent representations defeat the purpose of low-dimensional latent representation that is usually desirable for latent-variable generative models.

----------------------------------------------------------------------------------------------------------------

Which of the following is true for autoregressive generative models?

i. For autoregressive generative modeling of images it is better to form the factors such that the image is generated progressively in increasing scales such that the higher scales are conditioned on all pixels of the lower resolutions.

ii. One recurrent networks can be used for modelling all the conditionals of an autoregressive generative factorization P(\mathbf{x}^{(1)})\prod_{i=2}^{d}P(\mathbf{x}^{(i)}|\mathbf{x}^{(<i)}). At each iteration, the input to the recurrent network will be the last dimension and the last hidden state (representing the dimensions before the previous one) while the corresponding output will be the distribution over the next dimension.

iii. In autoregressive generative modeling, several ordering of the factorization is possible (by applying the chain rule on different orders of the input dimensions), however, such different orders of the factorization usually gives similar modelling results.

----------------------------------------------------------------------------------------------------------------

Which of the following is true for the fundamental idea behind normalizing flow generative models?


i. There exists a smooth invertible transformation \mathbb{R}^d\rightarrow\mathbb{R}^d
 that can take a single-modal input probability distribution and turn it into a multi-modal output probability distribution.


ii. The individual transformations of a normalizing-flow model need to keep the dimensionality fixed for the transformation function to be invertible (defined as bijectivity here).


iii. Depending on the sign of the elements of the Jacobian of the transformation (negative or positive), the mass is either contracted or expanded along different dimensions when going from the input distribution to the output distribution.


-----------------------------------------------------------------------------------------------------------------
Consider a Variational Auto-Encoder (VAE) which consists of an encoder and a decoder, then assign each of the notions below into either, both, or none.

i. Involves the amortization (i.e., optimization as network's inference) of the hardness of the VAE objective (i.e., individual variational inference per sample)

ii. Can be used as a generative model after training.

iii. Output of the network are parameters of a parametric probability distribution
\(P(\mathbf{x}l\mathbf{z))\)

iv. Can be used as a compression model after training.

v. Network parameters are learnt in a maximum likelihood (ML) fashion

vi. Usually requires sampling for optimizing the objective on an input example during training.

vii. Network parameters are learnt by variational approximation of their posterior.
\(P(\mathbf{z}l\mathbf{x})\)

------------------------------------------------------------------------------------------------------------------

Imagine we have a trained normalizing flow generative model. Which of the following is true for various types of inference with such a trained model?

The density estimation process essentially amounts to: 1) applying the inverse transformations on a new input sample x to obtain the corresponding and then 2) take the likelihood of in the starting distribution.


The generation process involves sampling from the starting distribution and then sequentially passing the sample through all the transformation functions.


For the coding process (latent representation) of plain normalizing flow models, a sampling is required from the trained distribution.


For the generation process, since the transformation functions are bijective, the order of applying the transformation functions does not matter.


For the coding process (latent representation) all the intermediate representations (
) can be considered a proper latent representation and thus used for coding (we do not have to necessarily use
, i.e., the starting random variable.)


For the coding process (latent representation) one has to use the inverse of transformation functions starting from the last inverse function operating on the input.

-----------------------------------------------------------------------------------------------------------------
Which of the following is a correct statement about generative models?


i. Fully-observable generative models never have any hidden representation, neither explicitly learned ones by optimizing P(\mathbf{x}|\mathbf{z}) nor implicitly learned deterministic ones.


ii. Models that can learn class-conditional distributions P(\mathbf{x}|y) can also be called generative models.


iii. Generative models always have some hidden (latent) representation of the input samples which can be either an explicit or an implicit hidden representation.


iv. PCA can be considered as a generative model.


v. A model that can assign likelihood to samples corresponding to P(\mathbf{x}) can be considered as a generative model.


vi. A model that can generate some realistic samples can be called generative model.

---------------------------------------------------------------------------------------------------------------------

There are some key differences between CNNs and the neural networks consisting of only fully connected layers (often called multiplayer perceptrons, MLPs). Which of the following statements are true?


a. CNNs have a natural shift invariance which MLPs do not.

b. CNNs have more parameters than MLPs and can thus fit more complex data.

c. In a convolutional layer of the CNN, output values are functions of a subset of the input values.

d. In a fully connected layer of the MLP, all output values are functions of all input values


Which of these data types are especially compatible for CNN applications?


a. Data where features can be shuffled around without changing the properties of the data

b. Images

c. Data where features have a spatial structure

d. Tabular data
------------------------------------------------------------------------------------------------------------------

Consider the following 4x4 image and a convolutional layer, with 2x2 weight matrix as shown below, bias = 1, stride = 2 and no zero padding. The activation function of the layer is a ReLU function. What is the result after applying the convolutional layer to the 4x4 image, followed by a 2x2 max-pooling operation?

image = [-3, 3, -4, 4;

1, 1, 4, -4;

-2, 2, 1, 2;

3, -3, 2, 1]

max-pooling = [-1, 1;
              1, -1]


Closely tied to convolution is the padding operation, which may be used to control the size of the feature map after a convolution. A padding of adds extra pixels on each side of the input feature maps, before performing the convolution. There are many ways to pick these pixel values, most commonly setting them to 0 (so-called zero-padding).

Assume that the input feature maps are of size 100x100, and we are applying convolution with a single filter. Furthermore, assume a stride of 1.

What will the size of the output feature map be in the following scenarios?


a. 1x1 filter, no padding = ?

b. 3x3 filter, no padding = ?

c. 1x1 filter, padding of 1 = ?

d. 3x3 filter, padding of 1 = ?

e. 3x3 filter, padding of 2 = ?

f. 5x5 filter, padding of 2 = ?


------------------------------------------------------------------------------------------------------------------

Consider the following image recognition problems:

Image classification

Semantic segmentation

Instance segmentation (task of doing segmentation different objects in an image - for instance, two pedestrians next to each
other would be segmented as different instances of the class ''pedestrian''. In regular semantic segmentation, both set of
pixels would just be classified as ''pedestrian'')

Object localization

Object detection

Which of the following statements are true?


1. When instance segmentation is solved, object detection comes for free.

2. Semantic segmentation and image classification are typically handled using cross-entropy loss.

3. Object detection amounts to drawing bounding boxes around all objects in the image.

4. Object detection is just the combination of classification and localization of the objects in a picture.


------------------------------------------------------------------------------------------------------------------

A challenge in object detection is that each image may contain a variable number of objects. Which of the following options are reasonable choices for dealing with this problem?

a. Let the network ouput a single detection for each image. The single detection is then upsampled to a variable number of detections by using transpose convolution with a learnable kernel.

b. Use a CNN of variable depth, with each layer representing a separate object.

c. Let the network output a large fixed number of detections. For each detection, the network also predicts a probability of the detection being an object. Detections with a low probability of being an object are then discarded.


Suppose that you want to do upsampling of a feature map from dimension 2×2×C to 4×4×C with stride 2 and filter size 2. I.e. in such a way that:

- Each input pixel (feature vector) maps to 4 output pixels
- Each output pixel depends on only one input pixel

Which upsampling methods could potentially yield an output with 16 unique values?

a. Max unpooling

b. "Bed-of-nails" unpooling

c. Transposed convolutions

d. Nearest neighbor unpooling


------------------------------------------------------------------------------------------------------------------

Consider a convolutional layer, with:
- Input size: 28x28x64
- Output size: 14x14x32
- Filter size: 5x5

Answer the following questions with the right numerical value (you don't need to consider biases in this exercise).

a. Given one filter, how many times do we apply the sliding window over the input?
Hint: We have not specified the size of the padding and the stride, but you can figure this out from the output dimensions.

b. How many multiplications are performed every time we apply the filter to a 5x5 patch of the input?
Hint: Recall that each filter takes all input channels as input.

c. How many multiplications in total are performed in this layer?
Hint: Note that the output dimension is 14 x 14 x 32, which means that we have 32 output channels and thus 32 filters.


Which of the following is not a proper scoring rules?


a. Negative Log Likelihood


b. All choices are proper scoring rules.


c. Mean Squared Error


d. Mean Absolute Error


------------------------------------------------------------------------------------------------------------------


Which of the following can be the source of uncertainty in our prediction?


a. limited number of observations for training


b. training for too long


c. changes in the distribution of data from training to test


d. noise in the training labels


e. noise in the input data


f. training for limited amount of time


g. too large of a training dataset


h. limited model capacity

------------------------------------------------------------------------------------------------------------------

Which of the following are an approximation technique for Bayesian Modeling?


a. Variational Inference


b. Expectation Propagation


c. MCMC


Which of the following can be used for uncertainty estimation with deep networks?


a. Training multiple networks with different initialization and using the collection of those outputs to reason about the model's uncertainty


b. Assuming simple posterior distributions for a deep network's weights and estimate the parameters of those distributions using variational inference


c. Taking stochastic regularizations that are embedded in some pretrained networks and interpret them as some form of approximate inference


Suppose you have a mini-batch size of 64 and that the size of the training set is m = 64000

If the mini-batch gradient descent requires 10 times as many iterations to converge (than batch gradient descent), should you still use it? That is, which algorithm do you think runs faster?


a. Batch gradient descent is probably faster.

b. Mini-batch gradient descent is probably faster.


Let's say I have the follwing table in postgresql

MyTable
ID, Name, Value
1 , Yo, 100
1, Yo, 200
1, Yo, 300

I want to turn this 'MyTable' to something like this,

MyTable
ID, Name, Value
1 , Yo1, 100
2, Yo2, 200
3, Yo3, 300

how can I do this ?

------------------------------------------------------------------------------------------------------------------

I have a panda's DataFrame like following,

MyTable
ID, Name
1 , Yoooo
2, Yo
3, Yoo

I want something like the following table where length of each 'Name' in a new columns name 'len_string'
ID, Name, len_string
1 , Yoooo, 5
2, Yo, 2
3, Yoo, 3


"2023-01-01T00:00:00+00:00"

"2023-10-31T00:00:00+00:00"