Advertisement
aghoshpro

Prompt Engineering

Jun 29th, 2023 (edited)
208
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. Which of the following is true for generalized disentangled representation learning and specifically
  2. β-VAE?
  3.  
  4. i. Considering a diagonal covariance matrix in the prior distribution on the distribution of latent representation, encourages each latent dimension to model an individual underlying generative factor (axis-aligned).
  5.  
  6.  
  7. ii. We can make the training goal of VAE more disentangling by giving the "prior KL" term more weight by breaking it down into a "expected likelihood term" and a "prior KL term."
  8.  
  9.  
  10. iii. If we decompose the training objective of VAE into an "expected likelihood term" and a "prior KL" term and then increase the relative weight of the "prior KL" term we might lose some reconstruction (and thus generative) quality of VAE.
  11.  
  12.  
  13. iv. Considering a diagonal covariance matrix in the prior distribution on the distribution of latent representation, encourages latent dimensions to be uncorrelated .
  14.  
  15.  
  16. v. Disentanglement of the latent representation improves interpretability of the latent space since we can use/interpret each dimension of the latent representation as a decoupled "knob" for the generative process
  17.  
  18. -----------------------------------------------------------------------------------------------------------------
  19.  
  20. Imagine we have a trained Variational Auto-Encoder (VAE), which of the following is true for using the model for various types of inference?
  21.  
  22.  
  23. a) We can use importance sampling for the estimation of density estimation with VAE using the decoder (P(x|z)) with samples guided by the encoder (Q(z|x))
  24.  
  25.  
  26. b) We can use Monte-Carlo estimation for density estimation  with VAE using the decoder (P(x|z)).
  27.  
  28.  
  29. c) For producing the coding for a sample, normally, we do not need the decoder but we need the assumed prior P(z).
  30.  
  31.  
  32. d) For generating a batch of realistic samples following the underlying P(x), normally, we do not need the encoder but we need the assumed prior P(z).
  33.  
  34.  
  35.  
  36. -----------------------------------------------------------------------------------------------------------------
  37.  
  38. Which of the following statements is true for Variational Auto-Encoders (VAEs)?
  39.  
  40.  
  41.  
  42. a) The training objective of a VAE is the sum of multiple variational lower bounds (ELBOs)  each of which corresponding to an approximate posterior of the latent coding of a training sample.
  43.  
  44.  
  45. b) For VAEs, we learn the generator parameters within the VI objectives of the latent code inference.
  46.  
  47.  
  48. c) If we decompose an ELBO objective of a VAE into an "expected likelihood" term and a "prior KL" term, the encoder network is involved in both terms while the decoder network is only involved in the "expected likelihood".
  49.  
  50.  
  51. d) We need the "reparametrization trick" in VAEs since we need to take a sample of the decoder during training and such sampler is not normally differentiable for the backpropagation to work
  52.  
  53.  
  54.  
  55. Imagine a 4-dimensional input space where each dimension can take 3 discrete values, \mathbf{x}\in\{0,1,2\}^4.
  56.  
  57. Imagine we do a factorization of the marginal as
  58.  
  59. P(\mathbf{x})=P(\mathbf{x}^{(4)})P(\mathbf{x}^{(3)}|\mathbf{x}^{(4)})P(\mathbf{x}^{(2)}|\mathbf{x}^{(4)},\mathbf{x}^{(3)})P(\mathbf{x}^{(1)}|\mathbf{x}^{(4)},\mathbf{x}^{(3)},\mathbf{x}^{(2)}).
  60.  
  61. How many parameters are required in total for an exact modeling of this factorization?
  62.  
  63.  
  64. -----------------------------------------------------------------------------------------------------------------
  65.  
  66.  
  67.  
  68. Which of the following is true for latent variable generative models?
  69.  
  70.  
  71. a) Auto-Encoders learn a latent space that can be useful for dimensionality reduction but are usually not considered truly generative models since there is no guarantee on the generation process to follow that of P(x)
  72.  
  73. b) Variational Auto-Encoders are latent variable generative models.
  74.  
  75.  
  76. c) Latent variable generative models can be a principled approach to data compression and identifying the underlying generative factors of data.
  77.  
  78.  
  79. d) In order to have a latent-variable generative model, it can be enough to model P(z) and P(x|z) (it is not necessary to model P(z|x))
  80.  
  81.  
  82. e) It is usually assumed that the natural data (support of P(x)) lives on a much lower-dimensional region of the input space.
  83.  
  84.  
  85.  
  86. -----------------------------------------------------------------------------------------------------------------
  87.  
  88.  
  89. Which of the following can be considered a generative model as defined in the course?
  90.  
  91.  
  92. i. A standard autoencoder trained on ImageNet with the reconstruction loss (\sum_{i=1}^{n}|\mathbf{x}_i-\hat{\mathbf{x}}_i|^2_2).
  93.  
  94.  
  95. ii. A conditional model P(\mathbf{x}|y=c) trained only on the dog classes of ImageNet with a 1000 classes.
  96.  
  97.  
  98. iii. A model that is trained on ImageNet and can take an image as input and perfectly classifies (P(y|\mathbf{x})) it to the 1000 classes of ImageNet.
  99.  
  100.  
  101. iv. A model that is trained on ImageNet and can generate unseen realistic samples from the same distribution as that of ImageNet but cannot do density estimation nor latent coding.
  102.  
  103.  
  104. v. A model that is trained on whole ImageNet but only generates samples from the animal classes of ImageNet.
  105.  
  106. ---------------------------------------------------------------------------------------------------------------
  107.  
  108. Which of the following can be considered as a disadvantage for a plain normalizing flow model?
  109.  
  110.  
  111.  
  112. i. One cannot encourage a decorrelated latent representation with normalizing flow.
  113.  
  114.  
  115. ii. The smoothness and invertibility requirements of the transformation functions restricts the expressiveness of the individual transformation functions that can be used.
  116.  
  117.  
  118. iii. The requirement to preserve the sample dimensionality for all the latent representations defeat the purpose of low-dimensional latent representation that is usually desirable for latent-variable generative models.
  119.  
  120. ----------------------------------------------------------------------------------------------------------------
  121.  
  122. Which of the following is true for autoregressive generative models?
  123.  
  124. i. For autoregressive generative modeling of images it is better to form the factors such that the image is generated progressively in increasing scales such that the higher scales are conditioned on all pixels of the lower resolutions.
  125.  
  126. ii. One recurrent networks can be used for modelling all the conditionals of an autoregressive generative factorization P(\mathbf{x}^{(1)})\prod_{i=2}^{d}P(\mathbf{x}^{(i)}|\mathbf{x}^{(<i)}). At each iteration, the input to the recurrent network will be the last dimension and the last hidden state (representing the dimensions before the previous one) while the corresponding output will be the distribution over the next dimension.
  127.  
  128. iii. In autoregressive generative modeling, several ordering of the factorization is possible (by applying the chain rule on different orders of the input dimensions), however, such different orders of the factorization usually gives similar modelling results.
  129.  
  130. ----------------------------------------------------------------------------------------------------------------
  131.  
  132. Which of the following is true for the fundamental idea behind normalizing flow generative models?
  133.  
  134.  
  135. i. There exists a smooth invertible transformation \mathbb{R}^d\rightarrow\mathbb{R}^d
  136.  that can take a single-modal input probability distribution and turn it into a multi-modal output probability distribution.
  137.  
  138.  
  139. ii. The individual transformations of a normalizing-flow model need to keep the dimensionality fixed for the transformation function to be invertible (defined as bijectivity here).
  140.  
  141.  
  142. iii. Depending on the sign of the elements of the Jacobian of the transformation (negative or positive), the mass is either contracted or expanded along different dimensions when going from the input distribution to the output distribution.
  143.  
  144.  
  145. -----------------------------------------------------------------------------------------------------------------
  146. Consider a Variational Auto-Encoder (VAE) which consists of an encoder and a decoder, then assign each of the notions below into either, both, or none.
  147.  
  148. i. Involves the amortization (i.e., optimization as network's inference) of the hardness of the VAE objective (i.e., individual variational inference per sample)
  149.  
  150. ii. Can be used as a generative model after training.
  151.  
  152. iii. Output of the network are parameters of a parametric probability distribution
  153. \(P(\mathbf{x}l\mathbf{z))\)
  154.  
  155. iv. Can be used as a compression model after training.
  156.  
  157. v. Network parameters are learnt in a maximum likelihood (ML) fashion
  158.  
  159. vi. Usually requires sampling for optimizing the objective on an input example during training.
  160.  
  161. vii. Network parameters are learnt by variational approximation of their posterior.
  162. \(P(\mathbf{z}l\mathbf{x})\)
  163.  
  164. ------------------------------------------------------------------------------------------------------------------
  165.  
  166. Imagine we have a trained normalizing flow generative model. Which of the following is true for various types of inference with such a trained model?
  167.  
  168. The density estimation process essentially amounts to: 1) applying the inverse transformations on a new input sample x to obtain the corresponding and then 2) take the likelihood of in the starting distribution.
  169.  
  170.  
  171. The generation process involves sampling from the starting distribution and then sequentially passing the sample through all the transformation functions.
  172.  
  173.  
  174. For the coding process (latent representation) of plain normalizing flow models, a sampling is required from the trained distribution.
  175.  
  176.  
  177. For the generation process, since the transformation functions are bijective, the order of applying the transformation functions does not matter.
  178.  
  179.  
  180. For the coding process (latent representation) all the intermediate representations (
  181. ) can be considered a proper latent representation and thus used for coding (we do not have to necessarily use
  182. , i.e., the starting random variable.)
  183.  
  184.  
  185. For the coding process (latent representation) one has to use the inverse of transformation functions starting from the last inverse function operating on the input.
  186.  
  187. -----------------------------------------------------------------------------------------------------------------
  188. Which of the following is a correct statement about generative models?
  189.  
  190.  
  191.  
  192. i. Fully-observable generative models never have any hidden representation, neither explicitly learned ones by optimizing P(\mathbf{x}|\mathbf{z}) nor implicitly learned deterministic ones.
  193.  
  194.  
  195. ii. Models that can learn class-conditional distributions P(\mathbf{x}|y) can also be called generative models.
  196.  
  197.  
  198. iii. Generative models always have some hidden (latent) representation of the input samples which can be either an explicit or an implicit hidden representation.
  199.  
  200.  
  201. iv. PCA can be considered as a generative model.
  202.  
  203.  
  204. v. A model that can assign likelihood to samples corresponding to P(\mathbf{x}) can be considered as a generative model.
  205.  
  206.  
  207. vi. A model that can generate some realistic samples can be called generative model.
  208.  
  209. ---------------------------------------------------------------------------------------------------------------------
  210.  
  211. There are some key differences between CNNs and the neural networks consisting of only fully connected layers (often called multiplayer perceptrons, MLPs). Which of the following statements are true?
  212.  
  213.  
  214.  
  215. a. CNNs have a natural shift invariance which MLPs do not.
  216.  
  217. b. CNNs have more parameters than MLPs and can thus fit more complex data.
  218.  
  219. c. In a convolutional layer of the CNN, output values are functions of a subset of the input values.
  220.  
  221. d. In a fully connected layer of the MLP, all output values are functions of all input values
  222.  
  223.  
  224. Which of these data types are especially compatible for CNN applications?
  225.  
  226.  
  227. a. Data where features can be shuffled around without changing the properties of the data
  228.  
  229. b. Images
  230.  
  231. c. Data where features have a spatial structure
  232.  
  233. d. Tabular data
  234. ------------------------------------------------------------------------------------------------------------------
  235.  
  236. Consider the following 4x4 image and a convolutional layer, with 2x2 weight matrix as shown below, bias = 1, stride = 2 and no zero padding. The activation function of the layer is a ReLU function. What is the result after applying the convolutional layer to the 4x4 image, followed by a 2x2 max-pooling operation?
  237.  
  238. image = [-3, 3, -4, 4;
  239.  
  240. 1, 1, 4, -4;
  241.  
  242. -2, 2, 1, 2;
  243.  
  244. 3, -3, 2, 1]
  245.  
  246. max-pooling = [-1, 1;
  247.              1, -1]
  248.  
  249.  
  250. Closely tied to convolution is the padding operation, which may be used to control the size of the feature map after a convolution. A padding of adds extra pixels on each side of the input feature maps, before performing the convolution. There are many ways to pick these pixel values, most commonly setting them to 0 (so-called zero-padding).
  251.  
  252. Assume that the input feature maps are of size 100x100, and we are applying convolution with a single filter. Furthermore, assume a stride of 1.
  253.  
  254. What will the size of the output feature map be in the following scenarios?
  255.  
  256.  
  257. a. 1x1 filter, no padding = ?
  258.  
  259. b. 3x3 filter, no padding = ?
  260.  
  261. c. 1x1 filter, padding of 1 = ?
  262.  
  263. d. 3x3 filter, padding of 1 = ?
  264.  
  265. e. 3x3 filter, padding of 2 = ?
  266.  
  267. f. 5x5 filter, padding of 2 = ?
  268.  
  269.  
  270.  
  271. ------------------------------------------------------------------------------------------------------------------
  272.  
  273. Consider the following image recognition problems:
  274.  
  275. Image classification
  276.  
  277. Semantic segmentation
  278.  
  279. Instance segmentation (task of doing segmentation different objects in an image - for instance, two pedestrians next to each
  280. other would be segmented as different instances of the class ''pedestrian''. In regular semantic segmentation, both set of
  281. pixels would just be classified as ''pedestrian'')
  282.  
  283. Object localization
  284.  
  285. Object detection
  286.  
  287. Which of the following statements are true?
  288.  
  289.  
  290. 1. When instance segmentation is solved, object detection comes for free.
  291.  
  292. 2. Semantic segmentation and image classification are typically handled using cross-entropy loss.
  293.  
  294. 3. Object detection amounts to drawing bounding boxes around all objects in the image.
  295.  
  296. 4. Object detection is just the combination of classification and localization of the objects in a picture.
  297.  
  298.  
  299. ------------------------------------------------------------------------------------------------------------------
  300.  
  301. A challenge in object detection is that each image may contain a variable number of objects. Which of the following options are reasonable choices for dealing with this problem?
  302.  
  303. a. Let the network ouput a single detection for each image. The single detection is then upsampled to a variable number of detections by using transpose convolution with a learnable kernel.
  304.  
  305. b. Use a CNN of variable depth, with each layer representing a separate object.
  306.  
  307. c. Let the network output a large fixed number of detections. For each detection, the network also predicts a probability of the detection being an object. Detections with a low probability of being an object are then discarded.
  308.  
  309.  
  310. Suppose that you want to do upsampling of a feature map from dimension 2×2×C to 4×4×C with stride 2 and filter size 2. I.e. in such a way that:
  311.  
  312. - Each input pixel (feature vector) maps to 4 output pixels
  313. - Each output pixel depends on only one input pixel
  314.  
  315. Which upsampling methods could potentially yield an output with 16 unique values?
  316.  
  317. a. Max unpooling
  318.  
  319. b. "Bed-of-nails" unpooling
  320.  
  321. c. Transposed convolutions
  322.  
  323. d. Nearest neighbor unpooling
  324.  
  325.  
  326. ------------------------------------------------------------------------------------------------------------------
  327.  
  328. Consider a convolutional layer, with:
  329. - Input size: 28x28x64
  330. - Output size: 14x14x32
  331. - Filter size: 5x5
  332.  
  333. Answer the following questions with the right numerical value (you don't need to consider biases in this exercise).
  334.  
  335. a. Given one filter, how many times do we apply the sliding window over the input?
  336. Hint: We have not specified the size of the padding and the stride, but you can figure this out from the output dimensions.
  337.  
  338. b. How many multiplications are performed every time we apply the filter to a 5x5 patch of the input?
  339. Hint: Recall that each filter takes all input channels as input.
  340.  
  341. c. How many multiplications in total are performed in this layer?
  342. Hint: Note that the output dimension is 14 x 14 x 32, which means that we have 32 output channels and thus 32 filters.
  343.  
  344.  
  345. Which of the following is not a proper scoring rules?
  346.  
  347.  
  348. a. Negative Log Likelihood
  349.  
  350.  
  351. b. All choices are proper scoring rules.
  352.  
  353.  
  354. c. Mean Squared Error
  355.  
  356.  
  357. d. Mean Absolute Error
  358.  
  359.  
  360. ------------------------------------------------------------------------------------------------------------------
  361.  
  362.  
  363. Which of the following can be the source of uncertainty in our prediction?
  364.  
  365.  
  366. a. limited number of observations for training
  367.  
  368.  
  369. b. training for too long
  370.  
  371.  
  372. c. changes in the distribution of data from training to test
  373.  
  374.  
  375. d. noise in the training labels
  376.  
  377.  
  378. e. noise in the input data
  379.  
  380.  
  381. f. training for limited amount of time
  382.  
  383.  
  384. g. too large of a training dataset
  385.  
  386.  
  387. h. limited model capacity
  388.  
  389. ------------------------------------------------------------------------------------------------------------------
  390.  
  391. Which of the following are an approximation technique for Bayesian Modeling?
  392.  
  393.  
  394. a. Variational Inference
  395.  
  396.  
  397. b. Expectation Propagation
  398.  
  399.  
  400. c. MCMC
  401.  
  402.  
  403. Which of the following can be used for uncertainty estimation with deep networks?
  404.  
  405.  
  406. a. Training multiple networks with different initialization and using the collection of those outputs to reason about the model's uncertainty
  407.  
  408.  
  409. b. Assuming simple posterior distributions for a deep network's weights and estimate the parameters of those distributions using variational inference
  410.  
  411.  
  412. c. Taking stochastic regularizations that are embedded in some pretrained networks and interpret them as some form of approximate inference
  413.  
  414.  
  415. Suppose you have a mini-batch size of 64 and that the size of the training set is m = 64000
  416.  
  417. If the mini-batch gradient descent requires 10 times as many iterations to converge (than batch gradient descent), should you still use it? That is, which algorithm do you think runs faster?
  418.  
  419.  
  420. a. Batch gradient descent is probably faster.
  421.  
  422. b. Mini-batch gradient descent is probably faster.
  423.  
  424.  
  425. Let's say I have the follwing table in postgresql
  426.  
  427. MyTable
  428. ID, Name, Value
  429. 1 , Yo, 100
  430. 1, Yo, 200
  431. 1, Yo, 300
  432.  
  433. I want to turn this 'MyTable' to something like this,
  434.  
  435. MyTable
  436. ID, Name, Value
  437. 1 , Yo1, 100
  438. 2, Yo2, 200
  439. 3, Yo3, 300
  440.  
  441. how can I do this ?
  442.  
  443. ------------------------------------------------------------------------------------------------------------------
  444.  
  445. I have a panda's DataFrame like following,
  446.  
  447. MyTable
  448. ID, Name
  449. 1 , Yoooo
  450. 2, Yo
  451. 3, Yoo
  452.  
  453. I want something like the following table where length of each 'Name' in a new columns name 'len_string'
  454. ID, Name, len_string
  455. 1 , Yoooo, 5
  456. 2, Yo, 2
  457. 3, Yoo, 3
  458.  
  459.  
  460. "2023-01-01T00:00:00+00:00"
  461.  
  462. "2023-10-31T00:00:00+00:00"
  463.  
  464.  
Tags: chatGPT
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement