Numerous-Carrot3910 t1_j5kuwul wrote on January 23, 2023 at 6:54 PM

Reply to comment by trnka in [D] Simple Questions Thread by AutoModerator

Yes, I was referring to recursive feature elimination. Thanks for the recommendations

Loquzofaricoalaphar OP t1_j5kmiqg wrote on January 23, 2023 at 6:02 PM

Reply to comment by MrEloi in [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar

Yes this is the sort of thing I am thinking about. Some percentage of people have very distinct styles, however with Ted it might have been the content that gave it away.

Yes I am familiar with amiunique and all the variables of the browser.

I wonder if this way of identifying people is ever used when google or others get subpoenaed and hand over stuff. It would be more accurate than IP in determining the individual with correlations it seems, however I wonder if accepted by or holds up in court of law?

Loquzofaricoalaphar OP t1_j5klsf9 wrote on January 23, 2023 at 5:58 PM

Reply to comment by 1980sMUD in [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar

😎

Loquzofaricoalaphar OP t1_j5kljft wrote on January 23, 2023 at 5:56 PM

Reply to comment by sothatsit in [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar

That’s Awesome, thanks for sharing boss

trnka t1_j5kksex wrote on January 23, 2023 at 5:51 PM

Reply to comment by Numerous-Carrot3910 in [D] Simple Questions Thread by AutoModerator

Hmm, you might also try feature selection. I'm not sure what you mean by not iterating, unless you mean recursive feature elimination? There are a lot of really fast correlation functions you can try for feature selection -- scikit-learn has some popular options. They run very quickly, and if you have lots of data you can probably do the feature selection part on a random subset of the training data.

Also, you could do things like dimensionality reduction learned from a subset of the training data, whether PCA or a NN approach.

jpercivalhackworth t1_j5kfcw6 wrote on January 23, 2023 at 5:18 PM

Reply to comment by Forsaken-Indication in [D] How to deal with COVID-19-era data for time series forecasting? by PM_ME_YOUR_GIGI

You are reading a lot into OP's question that I'm not seeing. Yes, COVID is anomalous, no it's not clear that for the purposes of modeling demand for an unidentified product that it makes sense to disregard it, adjust it, or perform so other adjustment. Depending on the what demand is being modeled COVID is still a factor.

You would do well to reread what they actually wrote and what I wrote. Nowhere did I say that they should predict the next pandemic (cool if they could, but not relevant here). Considering that COVID deaths appear to be climbing in parts of the world and we don't know where the OP is modeling for, there are a lot of unknowns to address before a meaningful answer can be arrived at.

GinoAcknowledges t1_j5kb95p wrote on January 23, 2023 at 4:52 PM

Reply to comment by Historical-Coat5318 in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut

A vast amount of technological knowledge (e.g. how to create poisons, manufacture bombs) has mass destructive potential if it can be scaled. The difficulty, just like with AI, is scaling, and this mostly self-regulates (with help from the government).

For example, you can build dangerous explosive devices in your garage. That knowledge is widely available (google "Anarchists Handbook"). If you try and build thousands of them (enough to cause mass destruction) the government will notice, and most likely, you aren't going to have enough money and time to do it.

The exact same thing will happen for "dangerous uses of AI". The only actors which have the hardware and capital to cause mass destruction with AI are the big tech firms developing AI. Try running inference at scale on even a 30B parameter model right now. It's extremely difficult unless you have access to multiple server-grade GPUs which are very expensive and hard to get ahold of even if you had the money.

Numerous-Carrot3910 t1_j5ka168 wrote on January 23, 2023 at 4:45 PM

Reply to comment by trnka in [D] Simple Questions Thread by AutoModerator

Thanks for your response! Even with retaining the top K values of each feature, there are still a large number of features to consider. I haven’t tried the hashing trick, so I will look into that

trnka t1_j5k77wb wrote on January 23, 2023 at 4:27 PM

Reply to comment by evys_garden in [D] Simple Questions Thread by AutoModerator

The difference from application-level evaluation is a bit vague in that text. I'll use a medical example that I'm more familiar with - predicting the diagnosis from text input.

Application-level evaluation: If the output is a diagnosis code and explanation, I might measure how often doctors accept the recommended diagnosis and read the explanation without checking more information from the patient. And I'd probably want a medical quality evaluation as well, to penalize any biasing influence of the model.

Non-expert evaluation: With the same model, I might compare 2-3 different models and possibly a random baseline model. I'd ask people like myself with some exposure to medicine which explanation is best for a particular case and I could compare against random.

That said I'm not used to seeing non-experts used as evaluators, though it makes some sense in the early stages of poor explanations.

I'm more used to seeing the distinction between real and artificial evaluation. I included that in my example above -- "real" would be when we're asking users to accomplish some task that relies on explanation and we're measuring task success. "Artificial" is more just asking for an opinion about the explanation but the evaluators won't be as critical as they would be in a task-based evaluation.

Hope this helps! I'm not an expert in explainability though I've done some work with it in production in healthcare tech.

trnka t1_j5k5ndr wrote on January 23, 2023 at 4:16 PM

Reply to comment by Z1ndabad in [D] Simple Questions Thread by AutoModerator

Yeah you can use a neural network instead of linear regression if you'd like. I usually start with linear regression though, especially regularized, because it usually generalizes well and I don't need to worry about overfitting so much.

Once you're confident that you have a working linear regression model then it can be good to develop the neural network and use the linear regression model as something to compare to. I'd also suggest a "dumb" model like predicting the average car price as another point of comparison, just to be sure the model is actually learning something.

I'm not familiar with the Levenberg–Marquardt algorithm so I can't comment on that. From the Wikipedia page it sounds like a second-order method, and those can be used if the data set is small but they're uncommon for larger data. Typically with a neural network we'd use an optimizer like plain stochastic gradient descent or a variation like Adam.

trnka t1_j5k4ldr wrote on January 23, 2023 at 4:10 PM

Reply to comment by Numerous-Carrot3910 in [D] Simple Questions Thread by AutoModerator

It depends on the data and the problems you're having with high-dimensional data.

If the variables are phrases like "acute sinusitis, site not specified" you could use a one hot encoding of ngrams that appear in them.
If you have many rare values, you can just retain the top K values per feature.
If those don't work, the hashing trick is another great thing to try. It's just not easily interpretable.
If there's any internal structure to the categories, like if they're hierarchical in some way, you can cut them off at a higher level in the hierarchy

Historical-Coat5318 t1_j5k3k1o wrote on January 23, 2023 at 4:03 PM

Reply to comment by Throwaway00000000028 in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut

I think so, yes. In that world the dead internet theory would become true and people will become only more dissociated from reality and society, especially so when AI can generate video and audio. The political repercussions are disastrous.

Also, I really love literature (and art in general) and a future where one cannot differentiate a human writer from AI is, frankly, suicidally bleak to me. I can see a future where publishers use AI to read the market and write the right books for maximum profit completely cutting out human authors from the process. I am an aspiring novelist myself and, while the act of writing is intrinsically motivating there is also a massive social component in terms of having a career and having others read your work that would be completely excised from creativity, so there is also a personal component I suppose. Sharing in the creativity of other humans is the main thing that gives life meaning to me personally and to many others, and to have that stripped from life is extremely depressing.

While this is all very speculative I just can't see the rapid advances in AI leading anywhere expect a lonelier, more isolated and chaotic world if it isn't seriously regulated. But all of this can be fixed if we could just identify AI text. Then nothing would change in terms of the place of human creativity in the world, it would be basically like chess, people still devote their lives to it and the community thrives but only because we can discern AI chess playing from human chess playing. Imagine if there were no anti-cheating policies in chess tournaments, no one would ever play chess seriously ever again.

If we could just identify AI output we would get all of the benefits of LLMs without any of the disastrous drawbacks. To me it is the most important issue right now, but people don't even consider it and are outright hostile to the idea, just see the downvotes to my original reply.

trnka t1_j5k34hg wrote on January 23, 2023 at 4:00 PM

Reply to comment by billbobby21 in [D] Simple Questions Thread by AutoModerator

I can't comment on OpenAI specifically, but in general it's in the terms of service of the API what they can and can't do with the model and/or data fed through it.

Throwaway00000000028 t1_j5k1aqv wrote on January 23, 2023 at 3:48 PM

Reply to comment by Historical-Coat5318 in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut

Just curious, why do you think it's "ethically imperative to be able to discern human text from AI text"? Would it really be so bad if you were talking to a computer?

op_prabhuomkar OP t1_j5k0h1j wrote on January 23, 2023 at 3:43 PM

Reply to comment by Ok_Two6167 in [P] Benchmarking some PyTorch Inference Servers by op_prabhuomkar

It’s actually easier to do for HTTP, will probably take that as a TODO. Thanks for the suggestion!

Historical-Coat5318 t1_j5jw8o8 wrote on January 23, 2023 at 3:14 PM

Reply to comment by TonyTalksBackPodcast in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut

I just can't even begin to comprehend this view. Of course, democratizing something sounds good, but if AI has mass-destructive potential it is obviously safer if a handful of people have that power than if eight billion have it. Even if AI isn't mass-destructive, which it obviously isn't yet, it is already extremely socially disruptive and if any given person has that power our governing bodies have basically no hope of steering it in the right direction through regulation, (which they would try to since it would serve their best interests as individuals). The common person would still have a say in these regulations through the vote.

Historical-Coat5318 t1_j5juhb7 wrote on January 23, 2023 at 3:02 PM

Reply to comment by BitterAd9531 in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut

AI in my view should be controlled by very few institutions, and these institutions should be carefully managed by experts and very intelligent people, which is the case for companies like Google or OpenAI. If AI must exist, and it must, I would much rather it were in the hands of people like Sam Altman and Scott Aaronson than literally everyone with an internet connection.

Obviously terms like "open-source" and "democratised" sound good, but if you think about the repercussions of this you will surely realise that it would be totally disastrous for society. Looking back in history we can see that nuclear weapons were actually quite judiciously managed when you consider all of the economic and political tensions of the time, now imagine if anyone could have bought a nuke at Walmart, human extinction would have been assured. Open-source AI is basically democratized mass-destruction, and if weapons of mass-destruction must exist (including AI), then it should be in as few hands as possible.

Even ignoring existential risk, which is obviously still very speculative, even LLMs should never be open-source because that makes any regulation impossible. In that world evidence (video, images and text), not to mention human creativity, would cease to exist and the internet would basically be unnavigable as the chasm between people's political conception of the world and the world itself only widens. Only a few companies should be allowed to have this technology, and they should be heavily regulated. I admit I don't know how this could be implemented, I just know that it should be.

This is basically Nick Bostrom's Vulnerable World Hypothesis. Bostrom should be read as a prerequisite for everyone involved in AI, in my opinion.

[deleted] t1_j5jsep7 wrote on January 23, 2023 at 2:47 PM

Reply to comment by samobon in [News] AMD Instinct MI300 APU for AI and HPC announced by samobon

[deleted]

Ok_Two6167 t1_j5jrd8u wrote on January 23, 2023 at 2:39 PM

Reply to [P] Benchmarking some PyTorch Inference Servers by op_prabhuomkar

Hello u/op_prabhuomkar,

That's a super cool test! any chance you can compare it to the HTTP API as well?

[deleted] t1_j5jpx47 wrote on January 23, 2023 at 2:29 PM

Reply to comment by DW_Dreamcatcher in [D] How to deal with COVID-19-era data for time series forecasting? by PM_ME_YOUR_GIGI

[deleted]

marr75 t1_j5joben wrote on January 23, 2023 at 2:17 PM

Reply to comment by Appropriate_Ant_4629 in [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models? by scarynut

Watermarks are a great way to ensure I use GPT-NeoX and allies instead of Da Vinci and allies.

billbobby21 t1_j5jnvmh wrote on January 23, 2023 at 2:13 PM

Reply to [D] Simple Questions Thread by AutoModerator

If you spend money training a model using OpenAI's API for example, do you actually own the model? As in lets say you train it so that it gets really good at writing short stories about animals. Would you then actually own that model and have the rights to use and/or license it to others? Or would OpenAI also be able to improve their own local models using the model that you created?

Basically I'm wondering what is stopping the company you are using to create a model from just stealing your creation.

ChuckSeven t1_j5jnow1 wrote on January 23, 2023 at 2:12 PM

Reply to comment by CarelessBar2844 in [D] ICLR 2023 results. by East-Beginning9987

link?

Forsaken-Indication t1_j5jniqa wrote on January 23, 2023 at 2:10 PM

Reply to comment by jpercivalhackworth in [D] How to deal with COVID-19-era data for time series forecasting? by PM_ME_YOUR_GIGI

Either you're trolling or you need to reread what they said bud. They know that 2020 and 2021 are anomalous due to covid. And, as is the case across most markets, 2022 is a "new normal" year. Yes, obviously covid continues to effect things, but no it doesn't make sense to try to force a model to use anomalous data from during the pandemic stage of the covid now that we're beyond that stage.

iLIVECSUI_741 t1_j5jmlzh wrote on January 23, 2023 at 2:03 PM

Reply to [D] Simple Questions Thread by AutoModerator

Hi, I wonder how to decide *When* it is ok to submit your work to top conferences. For example, I have a model related to biological data mining, I know KDD is coming soon but I do not like this conference and I would like to wait for NeurIPS. However, I am not sure if I will be scooped during this long period. Thanks for your help!

Recent comments in /f/MachineLearning