Problem Statement being an information scientist when it comes to marketing division at reddit.

11.10.2020 Zařazen do: Nezařazené — webmaster @ 11.27

i have to discover the most predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages them to determine which advertisements should populate on each page so we can use. Because this is a category issue, we’ll utilize Logistic Regression & Bayes models. Misclassifications in this full situation will be fairly benign and so I will make use of the precision score and set up a baseline of 63.3per cent to price success. Utilizing TFiDfVectorization, I’ll get the function value to find out which terms have actually the prediction power that is highest for the goal factors. If successful, this model may be utilized to a target other pages which have comparable frequency of this words that are same expressions.

Data Collection

See relationship-advice-scrape and dating-advice-scrape notebooks because of this part.

After switching most of the scrapes into DataFrames, they were saved by me as csvs that you can get into the dataset folder of the repo.

Information Cleaning and EDA

  • dropped rows with null self text line becuase those rows are useless in my experience.
  • combined name and selftext column directly into one brand new columns that are all_text
  • exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 which means that if i usually select the value that develops most frequently, i will be appropriate 63.3% of that time period.

First effort: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first set of scraping, pretty bad rating with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got a whole lot worse
  • tried with lemmatizer preprocessing instead and test score went up to 74per cent

Merely increasing the information and stratifying y in my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a lot. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a get a cross val to 82.3 but, these rating disappeared.

I believe Tfidf worked the most effective to diminish my overfitting due to variance issue because

we customized the stop terms to simply simply simply take the ones away that have been really too regular to be predictive. This is a success, nonetheless, with increased time we most likely could’ve tweaked them much more to boost all scores. Taking a look at both the solitary terms and terms in categories of two (bigrams) ended up being the most useful param that gridsearch proposed, but, every one of my top many predictive terms finished up being uni-grams. My initial a number of features had a good amount of jibberish terms and typos. Minimizing the # of that time period term had been expected to show as much as 2, helped be rid of the. Gridsearch also advised 90% max df rate which aided to remove oversaturated terms also. Finally, establishing max features to 5000 decreased cut down my columns to about 25 % of whatever they had been to just concentrate the absolute most frequently employed terms of that which was kept.

Conclusion and Recommendations

Also I was able to successfully lower the variance and there are definitely several words that have high predictive power though I would like to have higher train and test scores

thus I think the model is prepared to introduce a test. If marketing engagement increases, the exact same key term might be utilized to get other possibly profitable pages. It was found by me interesting that taking out fully the overly used terms aided with overfitting, but brought the precision rating down. I do believe there was probably nevertheless space to relax and play around with the paramaters associated with Tfidf Vectorizer to see if various end terms produce a different or


Used Reddit’s API, needs collection, and BeautifulSoup to clean articles from two subreddits: Dating information & union guidance, and trained a classification that is binary to anticipate which subreddit confirmed post originated from

Sdílejte tento článek pomocí:
  • Facebook
  • Twitter
  • email

Žádné komentáře »

Zatím nemáte žádné komentáře.

Napsat komentář

Get Adobe Flash playerPlugin by wordpress themes

Facebook na Facebooku


Code: | Design: Bombajs - w3cxhtml 1.1 w3ccss

Tento web je provozován s využitím systému WordPress. (Česká lokalizace)