Report this

What is the reason for this report?

Natural language classification with Python - best library/resource for determining sentiment associated w words?

Posted on November 5, 2020

So, I’ve been working through this tutorial and I understand it and am impressed with NLTK; however, I am somewhat more ambitious with my natural language processing goals.

What I’m doing is analyzing video game reviews of different games, to see how different games are reviewed differently. I have some large text corpuses of video games that seem to have received “normal” review cycles, and one big one that got review bombed. I want to do a couple of things:

  1. Compare the “normal” corpus with the “controversial” corpus and see what stands out most.
  2. Determine word associations, both sentiment wise (is “graphics” more positive for some games than others?) and in terms of each corpus having words occur more with other words (“graphics” and “nextgen” maybe). The thing is I don’t have a list of words to check - I want to sort of find this emergently.

Anyone able to advise how I should proceed? I’ve used NLTK both as shown in this tutorial and through TextBlob, but I’ve also used Tensorflow’s CPU based classification model and spaCy, the last of which is my favorite. I’m wondering if gensim - which I played around with briefly but I couldn’t tell if it was the right sort of specialized for what I’m doing - might be what I need.

Relatedly: I have all of my data stored individually as reviews associated with a particular video game in a neo4j instance. I can query neo at any time to retrieve this data. For these big corpuses, is there any value to running text classification on individual reviews? I wouldn’t even bother, I’d just combine them into giant corpuses to be broken apart, but only Tensorflow has been able to handle even my smaller corpus of reviews (about 3MB of text; the controversial game is almost 20MB of raw reviews).

Any help would be deeply appreciated! Cheers, Ellie



This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Hi Ellie,

Your project sounds really interesting! You’ve already used some great NLP libraries, and each one has its own strengths.

Here’s my take on how you could use them, or others, to achieve your goals:

  1. Comparing normal corpus with the controversial corpus:

You could use word frequency analysis or TF-IDF (Term Frequency-Inverse Document Frequency) to identify which words or phrases stand out in each corpus. Both NLTK and scikit-learn can help you with this.

For sentiment analysis, you can use NLTK’s Vader sentiment analyzer, TextBlob’s sentiment analysis, or spaCy with a sentiment model. Each of these can provide a sentiment score that can be compared between your two corpuses.

  1. Determining word associations:

For this, you could consider using collocations (words that often appear together) or n-grams (contiguous sequences of n words). NLTK has functionalities for both of these.

However, for a more semantic understanding of word relationships, you may want to consider using word embeddings. Word2Vec (available in gensim) or GloVe can help with this, providing you with vectors that can capture semantic relationships between words.

For finding emergent topics, you can also use a technique like Latent Dirichlet Allocation (LDA) for topic modelling. Gensim would be a good library for this.

  1. Your data storage:

There’s definitely value in analyzing individual reviews in addition to your overall corpus, particularly if you want to understand the range of sentiments or topics within your dataset. While Tensorflow is great for large datasets, you could also consider using PyTorch, which also has robust support for NLP tasks and can handle large datasets effectively.

If memory management is an issue while dealing with your corpus, consider using tools like Dask which can handle larger-than-memory computations by breaking them into smaller tasks.

Also, note that Deep Learning libraries (TensorFlow, PyTorch) tend to work best when you have a lot of data, as the models they use are typically larger and more complex. Traditional machine learning methods (like those implemented in scikit-learn or gensim) can work well with smaller datasets.

You might also want to check out Hugging Face's transformers library. They provide a lot of pre-trained models that you can use for different NLP tasks, including sentiment analysis and named entity recognition:

https://huggingface.co/docs/transformers/index

I hope this helps, and best of luck with your project!

Best,

Bobby

Heya,

it sounds like you have an exciting and complex project on your hands! Analyzing video game reviews to determine sentiment and word associations is a great application of natural language processing (NLP). Given your ambitions and the tools you’ve mentioned, here’s a multi-faceted approach you can take:

1. Corpus Comparison

To compare the “normal” corpus with the “controversial” one:

  • Feature Extraction: Use techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to identify unique terms in each corpus. Tools like scikit-learn in Python can help with this.

  • Topic Modeling: Use LDA (Latent Dirichlet Allocation) or NMF (Non-negative Matrix Factorization) for topic modeling. gensim is excellent for this. You can identify topics that are predominantly present in one corpus compared to the other.

  • Sentiment Analysis: Perform sentiment analysis on both corpuses using TextBlob, NLTK, or spaCy to gauge overall sentiment trends. This will help in understanding if the “controversial” reviews are more negative overall.

2. Word Associations and Sentiment Correlation

  • Word Embeddings: Use word embeddings (Word2Vec, GloVe) from gensim or spaCy. This can help you find word associations and understand context. You can see how words like “graphics” are contextually situated across different reviews.

  • Collocations and Concordance: NLTK is good for finding collocations (words that frequently appear together). This can help identify terms that are commonly associated with each other in your data.

  • Sentiment Correlation: For each game, calculate the sentiment score of sentences or reviews containing specific keywords (like “graphics”). This will help you determine if certain aspects (like graphics) are viewed more positively in some games than others.

3. Text Classification on Individual Reviews

  • Granular Analysis: While aggregating reviews into large corpuses is useful for general trends, analyzing individual reviews allows for a more granular understanding of sentiment and can highlight outliers or particularly influential reviews.

  • Machine Learning Models: Since TensorFlow handles your data well, consider using it for individual review classification. You might experiment with different architectures like CNNs or RNNs for better performance.

4. Data Storage and Retrieval

  • Utilizing Neo4j: Since your data is stored in Neo4j, ensure efficient querying for retrieval. Depending on your analysis, you might retrieve data based on specific games, dates, or other parameters.

  • Batch Processing: For large datasets, consider batch processing. Retrieve manageable chunks of data from Neo4j, process them, and then move to the next batch.

5. Scalability and Performance

  • Handling Large Corpuses: For very large datasets, look into more scalable solutions like Apache Spark with its NLP capabilities, which can handle large-scale text processing more efficiently.

  • Optimize Your Code: Ensure your code is optimized for performance. Efficient data structures, parallel processing, and optimized queries can make a big difference.

6. Visualization and Reporting

  • Visual Tools: Use visualization tools like Matplotlib, Seaborn, or even dashboard tools like Tableau or Power BI to present your findings. Visualizations like word clouds, sentiment distributions, and topic prevalence can be very insightful.

7. Continuous Improvement

  • Iterative Process: NLP projects often require an iterative approach. Regularly refine your models and approaches based on the insights you gain.

  • Stay Updated: The field of NLP is rapidly evolving. Keep an eye on the latest research and tools that might offer improved methodologies or efficiencies.

Considering your preference for spaCy and experience with TensorFlow, these should be your primary tools, complemented by gensim for specific tasks like topic modeling or word embeddings. Remember, each tool has its strengths, and the best approach often involves a combination of several tools and techniques. Happy analyzing!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.