Recent comments in /f/MachineLearning
suflaj t1_j8qxasd wrote
Reply to comment by mems_m in [P] Struggling with thesis idea and implementation by mems_m
That's more of an issue of you searching. You mention sentiment analysis, for example, but it is a problem that is considered to be solved for years. There is no novelty you could do here besides a bigger model.
Obviously you need to stop looking at what people have done, and start looking at what in their process of doing something they didn't do or did poorly. One such thing is tokenization of text. You can't tell me that it's all figured out.
mems_m OP t1_j8qx61s wrote
Reply to comment by suflaj in [P] Struggling with thesis idea and implementation by mems_m
the thing is that i find that almost everything i can do has been already done on the public datasets i find
mems_m OP t1_j8qx1eg wrote
Reply to comment by mems_m in [P] Struggling with thesis idea and implementation by mems_m
novelty could be in the data or in the methods applied
suflaj t1_j8qx0qv wrote
Reply to comment by mems_m in [P] Struggling with thesis idea and implementation by mems_m
As I've said, there is no reason you can't do something novel with that, you just can't do what something else has done with it.
mems_m OP t1_j8qwyhk wrote
Reply to comment by suflaj in [P] Struggling with thesis idea and implementation by mems_m
They want us to find an existing dataset cause of the short time we have, and novelty is a big part of the assessment
suflaj t1_j8qwt5d wrote
People usually create datasets when they work on something new. I don't know why you would think that just because a dataset exists you can't or even need to outperform anything.
WarAndGeese t1_j8qw44j wrote
teenaxta t1_j8qvnx0 wrote
I think this has more to do with probability, the sum of all random variables approaches a gaussian distribution. We can prove it using Central limit theorem. So what that really means is that the noise can map all sorts of information. Also when you add noise consistently, at one point you reach the normal distribution however, the noise pattern at hand is unique. Think of it as this way, 0,0 have a mean of 0 while -1,1 also have a mean of 0. The unique noise pattern actually contains useful information where as if you were to create a blank canvas, your generator would have no idea about what to generate from it for it is a many to one mapping. The additive noise process is a unique mapping
ML4Bratwurst t1_j8qu836 wrote
Yes. It's called online learning
pp314159 OP t1_j8qsgd6 wrote
Reply to comment by killergoose75 in [P] Build data web apps in Jupyter Notebook with Python only by pp314159
Thank you! We are working on Cloud service to make it easier to deploy notebooks as apps.
pp314159 OP t1_j8qsdkk wrote
Reply to comment by DigThatData in [P] Build data web apps in Jupyter Notebook with Python only by pp314159
That's true! Voila is hard to beat.
I'm working on Mercury Cloud, so you can just upload notebook to the cloud to make it available as web app. This will help many users to deploy notebooks as apps.
We also provide commercial support for Mercury for Pro users.
Those are the pain points for Voila.
pp314159 OP t1_j8qs3r3 wrote
Reply to comment by Tomatoflee in [P] Build data web apps in Jupyter Notebook with Python only by pp314159
Sure, at the bottom of our website you can subscribe for newsletter.
hfnuser0000 t1_j8qoshn wrote
Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
I am interested in the theoretical aspect of how your model work. Says transformers, you have tokens that attend to other tokens. In the case of RNNs, a piece of information can be preserved for later uses but with a cost of reducing memory capacity for other information and once the information is lost, it's lost forever. So I think the context length of a RNN scale linearly with the memory capacity (and indirectly with the number of parameters), right?
DigThatData t1_j8qkl3f wrote
Reply to comment by autoraft in [P] Build data web apps in Jupyter Notebook with Python only by pp314159
all I know is voila works with panel, and panel works with basically everything (ipywidgets, bokeh, plotly...). not sure about streamlit/gradio.
Downchuck t1_j8qk6r0 wrote
Reply to [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
u/ExponentialCookie - In the Code Implementation link, lucidrains writes about reproducibility issues and tuning, both issues brought up in these comments.
farmingvillein t1_j8qj1u7 wrote
Reply to comment by bo_peng in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
> RWKV is the exception. When you look at loss against token position, it is comparable with transformers.
Can you link to what you are referring to? If I missed it in the OP post, my apologies.
farmingvillein t1_j8qipd4 wrote
Reply to comment by gwern in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
Let's think step by step:
You:
> I don't think the Related Works section of that paper provides any useful references.
Your own response to the question that was posed:
> https://arxiv.org/abs/1805.04623 > https://arxiv.org/abs/1702.04521
There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.
E.g., "Sharp Nearby, Fuzzy Far Away" is directly discussed in the cited "Transformer-XL":
> Empirically, previous work has found that LSTM language models use 200 context words on average (Khandelwal et al., 2018), indicating room for further improvement
> Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer
I never said this, so I'm not sure what your argument is.
> we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show)
Neither of the papers you link to (assuming you are talking about your own comment at https://www.reddit.com/r/MachineLearning/comments/1135aew/r_rwkv4_14b_release_and_chatrwkv_a_surprisingly/j8pg3g7/) make any reference to Transformers.
If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman. Re-read what I actually wrote:
> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.
My statement here is an empirical one around performance--which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.
The whole point is that an OP said that RNNs were attractive because of the theoretical infinite context--but my response was that 1) we don't really see that in practice, when we try to measure it directly (as both of our sources point out), and 2) we don't see evidence of superior long-distance behavior when testing against real-world(ish) data sets that should theoretically reward that. And that both of these points are encapsulated if you follow the reference I shared (or, as I noted, most reasonable "long-distance transformer" papers).
(As with all things research...someone may come out with a small modification tomorrow that invalidates everything above--but, for now, it represents the broad public (i.e., non-private) understanding of architecture behaviors.)
Kitchen_Tower2800 t1_j8qhsuy wrote
Reply to [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie
"It is more memory-efficient than Adam as it only keeps track of the momentum."
While this is technically true, is this a joke?
bo_peng OP t1_j8qhn5p wrote
Reply to comment by Kiseido in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
Try 3.8 3.9 3.10
bo_peng OP t1_j8qhiyk wrote
Reply to comment by farmingvillein in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
RWKV is the exception. When you look at loss against token position, it is comparable with transformers.
You can tell that from the generation results too.
bo_peng OP t1_j8qhad9 wrote
Reply to comment by mz_gt in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
Thank you :) Too busy for that at this moment, but I will get a paper out later this year.
afireohno t1_j8qg9eq wrote
Reply to comment by maizeq in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
There is some work on Frustratingly Short Attention Spans in Neural Language Modeling
Kiseido t1_j8qfm0j wrote
Reply to [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng
What version of Python is used for this project? I cannot find a number anywhere!
teb311 t1_j8qdjqv wrote
It’s not popular to do “online learning” for a variety of reasons. u/CabSauce gave a nice list. One reason I wanted to add was that many models are exposed to relatively uncontrolled input and that can backfire badly. Google “Microsoft Tay Twitter” for a cautionary tale. Garbage in garbage out; letting your model learn in an uncontrolled environment risks inputting (lots of) garbage, and sometimes even malicious/adversarial data. Making matters worse, since the garbage affects the model in real time the actively-getting-worse predictions just get made/published/used in a production setting.
In most cases the upside to continuous learning is small compared to batched releases, but it makes a lot of stuff harder and more risky.
rosenrotj t1_j8qxca5 wrote
Reply to [D] Simple Questions Thread by AutoModerator
Is it possible on Azure Machine Learning to run a notebook as an API? If yes where can I find this API ?