Recent comments in /f/MachineLearning

BenXavier OP t1_j4zd9jv wrote

Hey guys, thank you for the great responses!

  • "Accurate Intelligible Models with Pairwise Interactions" seems great - as far as I can understand. That's what I've been referring to with "adding structure to models". Crazily how thin is the reference section: lot of exciting work to do!
    • - Please do correct me if I'm being too naive, but are there other approaches for "building sub-models" at the splitting point?
  • u/mickman_10, u/TheFlyingDrildo, are you also aware of any connection with Symbolic Regression or Association Rule extraction?
2

dataslacker t1_j4z8zm4 wrote

Yes, your explanations are clear and are also how I understood the paper, but I feel like there's some motivation for the RL training that's missing. Why not "pseudo labeling"? Why is the RL approach better? Also the reward score is non-differentiable because it was designed that way, but they could have designed it to be differentiable. For example instead of decoding the log probs why not train the reward model on them directly? You can still obtain the labels via decoding them doesn't mean that has to be the input to the reward model. There are a number of design choice the authors made that are not motivated in the paper. I haven't read the reference so maybe they are motivated elsewhere in the literature, but RL seems like a strange choice for this problem since there isn't a dynamic environment that the agent is interacting with.

3

Agitated-Purpose-171 t1_j4z7iz5 wrote

Hi everybody, I have one question about VLAD while I read this paper (Aggregating local descriptors into a compact image representation) on CPVR.

My question is why VLAD works.

Aggregating local descriptors into a compact image representation paper links:

https://lear.inrialpes.fr/pubs/2010/JDSP10/jegou_compactimagerepresentation.pdf

In this paper, there is a network VLAD, it can turn the local features (N*D dimension) into a global feature (k* D dimension).

Below is my understanding of the operations of VLAD, step by step.

=> input: N*D dimension local feature.

(i) use k-means to find the k clusters and the central feature for each cluster.

(ii) for each cluster find a residual sum.

V = summation of ( each local feature in the cluster minus the central feature).

V = sum (Xi - C)

V: residual sum of the cluster

X: local feature in the cluster

C: Central feature of the cluster

(iii) concatenate the residual sum then get the global feature.

global feature = [V1,V2,....Vk]

(V1 is the residual sum of cluster 1, V2 is the residual sum of cluster 2... and so on.)

=> output: k*D dimension global feature.

My question is why the residual sum of each cluster is "not" zero.

Since the central feature of each cluster found by k-means is the average of the local feater of each cluster.

The central feature of cluster 1 = average of the local feature in cluster 1.

C1 = (X1 + X2 + X3 + ...+ Xm) / m

The residual sum of cluster 1 = (X1-C1) + (X2-C1) + (X3-C1) + ... + (Xm-C1) = V1

Based on the above equation, I think the residual sum of each cluster is zero. So the global feature will be a zero matrix = [V1, V2,..., Vk] = [zero vector, zero vector, ..., zero vector].

The only reason that came into my mind is that the iteration of the k means is not enough, so the central feature of each cluster is not equal to the average of the local feature in the cluster. Am I right?

Could anybody let me know why the residual sum is not a zero vector? Thanks a lot.

1

FastestLearner OP t1_j4z74l7 wrote

Yes. I too agree that a large model in not required for detecting simple words like "Please subscribe to our channel" or "Here is the sponsor of our video". I also have another idea which I think should help in getting better accuracies. Use the channel's unique identifier (UID) or the channel's name as input ( and generate conditional probabilities conditioned on the channel's UID). This should help because any particular YouTube channel almost always use the same phrase to introduce their sponsors in almost all of their videos. Think of LinusTechTips, you always here the same thing, "here's the segue to our sponsor yada yada." So this should definitely allow the model to do more accurate inference. Alternatively, you can just reduce the model complexity to save client's resources.

The other thing you mentioned about the average user not hitting the right arrow two times, I think (and this is my hypothesis), the graph of users using adblocking softwares is just increasing monotonically, because once a user gets to savour the internet without ads, they don't go back. Only the old aged folks and the absolutely-not-computer-savvy people don't use adblockers, and IMO that population is decreasing and in the (near) future, that population would simply vanish. This is similar to what Steve Jobs said when he was asked whether people would ever use the mouse. Look at now, everyone uses the mouse. Coming to sponsor blocking, not hitting the right arrow is just more convenient than hitting the right arrow two times. Sometimes hitting it x number of times does not get the job done and you need to hit it further. Also, you might miss the beginning of the non-sponsored segment, so you need to hit the left arrow once too. All of this is made convenient by the current SOTA SponsorBlock extension. It has just begun its journey and I have no doubt that just like the adblocking extensions, sponsorblocking is going to take off and see an exponential growth.

2

niclas_wue OP t1_j4yukoz wrote

Yes, it is possible to use citations as a measure of a paper's impact. However, when a paper is newly published, there are typically no citations yet, so this would result in a delayed signal. Retweets and GitHub stars provide a faster indication of a paper's impact. I believe that speed is important because, as a paper becomes older, there are already many reviews and articles written by humans that (at least for now) provide a better summary of the paper.

2

Category-Basic t1_j4yu5ck wrote

That is the million dollar question. A lot of clever people seem to be finding new ways all the time. I think that, at this point, it is safe to say that any task that has sufficient relevant data probably can be modeled and subject to ML. I might not be able to figure out how, but I am sure someone could.

1

dataslacker t1_j4yraoc wrote

Sorry I think didn’t do a great job asking the question. The reward model, as I understand it, will rank the N generated responses from the LLM. So why not take the top ranked response as ground truth, or a weak label if you’d like and train in a supervised fashion predicting the next token. This would avoid a he RL training which I understand is inefficient and unstable.

2

JoeHenzi t1_j4yowtu wrote

Taking a look - wanting to implement this in my application to explore parameter space, shoot for optimal, but actually am finding ChatGPT gets very cagey on the topic lately. Explored the topic of Genetic Algorithms, which it suggested would be less computationally expensive, then decided to not help me really get to coding it.

EDIT: This is exactly my use case...

1