Recent comments in /f/MachineLearning
xXWarMachineRoXx t1_jb1qzsm wrote
-
Whats [R] and [N] in title ?
-
Whats a dropout??
bo_peng OP t1_jb1qws0 wrote
Reply to comment by Spare_Side_5907 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
TNN is like convolution, while RWKV can be written as a CNN too (RWKV v1 is a CNN). So there's some similarity, though not much :)
bo_peng OP t1_jb1q5fu wrote
Reply to comment by luxsteele in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
Yes a paper is coming. Meanwhile you can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations :)
bo_peng OP t1_jb1po7i wrote
Reply to comment by _Arsenie_Boca_ in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
Will the 150 lines help? Please read the code first :)
https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py
This is ALL you need for RWKV inference.
And you can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations :)
HillaryPutin t1_jb1ovl4 wrote
Reply to comment by Zealousideal_Low1287 in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Bro sounds like the discussion comments in some of my university courses
jobeta t1_jb1otw2 wrote
Neat! What's early s.d. in the tables in the github repo?
tonicinhibition t1_jb1ntqz wrote
Reply to comment by currentscurrents in To RL or Not to RL? [D] by vidul7498
I don't think the author of the post took a position on the original argument; rather they just presented ways to explore the latent space and make comparisons that are reasonable so that we might derive better distance metrics.
I see it as a potential way to probe for evidence of mode collapse.
ml-research t1_jb1nlad wrote
Reply to To RL or Not to RL? [D] by vidul7498
People said similar things about deep learning a long time ago.
If you can use supervised learning, then you should, because it means you have tons of data with ground-truth labels for each decision. But many real-world problems are not like that. Even humans don't know if each of their decisions is optimal or not.
rpnewc t1_jb1k781 wrote
Reply to [D] Ethics of minecraft stable diffusion by NoLifeGamer2
If you are successful in getting noticed, you may get sued. If you are just one guy (not a company) may be not. But tread carefully. There may be a restricted licensing way you could show your work if you want to, but I am not an expert there.
2blazen t1_jb1jk5h wrote
Reply to comment by Quazar_omega in [P] LazyShell - GPT based autocomplete for zsh by rumovoice
And lazy
Like come on at least have a landing page
deekaire t1_jb1jd2i wrote
Reply to comment by PassionatePossum in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Great comment 👍
currentscurrents t1_jb1j20n wrote
Reply to comment by tonicinhibition in To RL or Not to RL? [D] by vidul7498
>Do GANS really model the true data distribution...
I find their argument to be pretty weak. Of course these images look semantically similar; they ran a semantic similarity search to find them.
They are clearly not memorized training examples. The pose, framing, and facial expressions are very different.
royalemate357 t1_jb1h7wl wrote
Reply to comment by Art10001 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
hmm I very much doubt it couldve ran 100x faster for the same parameter count, as you are memory bandwith bound (both GPT and RWKV have to load the parameters n times to generate n tokens). Also Im somewhat skeptical that you only need 3GB for 14B parameters *without offloading the model*, as even 4-bit quantization is 14B/2 = 7GB needed. and offloading the model is slow to the point of being unusable as you need to do CPU<->GPU transfers.
tonicinhibition t1_jb1fgpe wrote
Reply to comment by tripple13 in To RL or Not to RL? [D] by vidul7498
> people who discount GANs due to their lack of a likelihood
I was going to ask you to expand on this a little, but instead found a post that describes it pretty well for anyone else who is curious:
Do GANS really model the true data distribution...
For further nuance on this topic, Machine Learning Street Talk discussed interpolation vs extrapolation with Yann LeCun regarding interpolation vs extrapolation, which Letitia Parcalabescu summarizes here.
growqx t1_jb1duv8 wrote
Reply to comment by ilyakuzovkin in To RL or Not to RL? [D] by vidul7498
>Same way as one wouldn't use RL to multiply two numbers
Zealousideal_Low1287 t1_jb1ce8q wrote
Reply to comment by szidahou in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
IDK did you read it?
luxsteele t1_jb1b68d wrote
Reply to comment by _Arsenie_Boca_ in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
Totally agree.
I have been following this from some time but I can't fully understand it and explain it to my collaborators.
I work in ML and I have quite some experience with transformers and I still can't fully get it. Let alone convince some of my collaborator that is worth pursuing it.
It is paramount that we have a paper that explains this in more detail if we want the community to consider this seriously.
Please do it!
szidahou t1_jb19y51 wrote
How can authors be confident that this phenomenon is generally true?
[deleted] t1_jb19gqq wrote
Reply to comment by farmingvillein in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
[deleted]
farmingvillein t1_jb18evq wrote
Reply to comment by Toast119 in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho
Yes. In the first two lines of the abstract:
> Introduced by Hinton et al. in 2012, dropout has stood the test of time as a regularizer for preventing overfitting in neural networks. In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training.
rpnewc t1_jb17dvp wrote
Reply to comment by 2blazen in [D] The Sentences Computers Can't Understand, But Humans Can by New_Computer3619
For sure it can be taught. But I don't think the way to teach it is to give it a bunch of sentences from the internet and expect it to figure out advanced reasoning. It has to be explicitly tuned into the objective. A more interesting question is, then how can we do this for all domains of knowledge in a general manner? Well, that is the question. In other words, what is that master algorithm for learning? There is one (or a collection of them) for sure, but I don't think we are much close to it. ChatGPT is simply pretending to be that system, but it's not.
yannbouteiller t1_jb17aaw wrote
Reply to To RL or Not to RL? [D] by vidul7498
People will say anything in hope of drawing attention. Reframing an unexplored MDP into a supervised learning problem makes no sense.
Art10001 t1_jb176r8 wrote
Reply to comment by ThirdMover in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
Indeed.
Art10001 t1_jb172wo wrote
Reply to comment by royalemate357 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
It once said 100 times faster and 100 times less (V)RAM here. However, it now says that RWKV-14B can be run with only 3 GB of VRAM, which is regardless a massive improvement, because a 14B model normally requires about 30 GB of VRAM or thereabouts.
CatalyzeX_code_bot t1_jb1r9fn wrote
Reply to [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
Found relevant code at https://github.com/ridgerchu/SpikeGPT + all code implementations here
--
To opt out from receiving code links, DM me